[00:00.000 --> 00:05.620]  I just want to double check this.
[00:06.280 --> 00:10.840]  Alright, cool. I have all my stuff muted and we are good to go.
[00:10.960 --> 00:15.020]  Hello and welcome to Machine Learning for Security Analysts.
[00:15.020 --> 00:18.640]  This is going to be our final workshop for the AI village.
[00:19.020 --> 00:21.760]  This is going to be a beginner and interactive workshop
[00:21.760 --> 00:25.260]  for all those who are interested in machine learning and security
[00:25.260 --> 00:28.700]  and what this whole crazy AI thing is.
[00:28.700 --> 00:31.980]  So really, the major motivation for this workshop
[00:31.980 --> 00:35.800]  was that we're seeing a lot of buzz in the security community
[00:35.800 --> 00:38.360]  around AI and machine learning.
[00:38.360 --> 00:42.640]  But for many of us, it's not really clear what that is.
[00:43.460 --> 00:46.620]  So... oh, sorry.
[00:46.620 --> 00:49.020]  If you want to participate and ask questions,
[00:49.020 --> 00:52.100]  head over to the DevSchool channel and join the
[00:52.100 --> 00:55.780]  AIVillage-General-Voice, AIV-General-Voice.
[00:55.960 --> 00:58.680]  And you'll be able to ask us questions.
[00:59.280 --> 01:02.800]  If there's too many people joining and it gets unorganized,
[01:02.800 --> 01:03.900]  we're going to switch to Zoom.
[01:03.900 --> 01:08.140]  But it's sort of more open than it was before.
[01:09.000 --> 01:11.920]  Sorry, Gavin. Just needed to say that.
[01:12.260 --> 01:14.500]  No worries. Last line of this statement is
[01:14.500 --> 01:16.600]  I wanted to take, for the security community,
[01:16.600 --> 01:19.400]  the idea of machine learning and AI out of the buzzword
[01:19.400 --> 01:20.720]  and into the mainstream.
[01:22.040 --> 01:25.520]  So, my name is Gavin. I go by the pseudonym GT Klondike.
[01:25.520 --> 01:28.680]  I'm an independent security researcher and a security consultant.
[01:28.700 --> 01:32.720]  I'm very passionate about network attack and defense.
[01:32.720 --> 01:35.400]  And through that passion, I run a project called NetSecExplain,
[01:35.400 --> 01:37.340]  which is a blog and a YouTube channel,
[01:37.340 --> 01:39.900]  where I explain intermediate and advanced level
[01:39.900 --> 01:43.180]  computer networking concepts and security concepts
[01:43.550 --> 01:45.800]  in an easy-to-understand way.
[01:47.420 --> 01:49.920]  So, before I get started, usually in person,
[01:49.920 --> 01:51.360]  this is where I ask people,
[01:51.360 --> 01:53.280]  what are your thoughts when you hear the word
[01:53.280 --> 01:55.380]  machine learning? Right?
[01:55.380 --> 01:57.680]  And I've gotten a range of answers in the past
[01:57.680 --> 02:01.480]  that range from, oh, Skynet's going to take over the world,
[02:01.480 --> 02:04.760]  to you need a PhD in multivariable calculus
[02:04.760 --> 02:06.620]  and linear algebra.
[02:07.700 --> 02:11.640]  And while there is a mild amount of truth
[02:11.640 --> 02:14.120]  to the math background that you should have,
[02:14.600 --> 02:16.520]  neither of those are really true.
[02:17.920 --> 02:21.220]  Instead, machine learning is really a form of pattern matching.
[02:21.220 --> 02:22.740]  It's applied statistics.
[02:22.740 --> 02:27.380]  And with the increase in power behind the computer memory
[02:27.680 --> 02:28.520]  and behind the CPU,
[02:28.520 --> 02:30.200]  we're able to use these statistics
[02:30.200 --> 02:34.380]  in a much quicker, more processed way.
[02:35.980 --> 02:37.620]  So, first I want to kind of cover
[02:37.620 --> 02:39.900]  this idea of machine learning,
[02:39.900 --> 02:41.920]  artificial intelligence, and deep learning.
[02:41.920 --> 02:45.120]  In a lot of areas, it's being used interchangeably,
[02:45.120 --> 02:46.960]  and that's not really correct.
[02:47.240 --> 02:49.440]  First, what we have is this large umbrella
[02:49.440 --> 02:51.620]  of artificial intelligence.
[02:51.620 --> 02:53.360]  And artificial intelligence is
[02:54.220 --> 02:58.340]  kind of a philosophy, in my opinion.
[02:58.340 --> 03:00.320]  Artificial intelligence used to be,
[03:00.320 --> 03:03.400]  when the first digital calculators came out,
[03:03.400 --> 03:05.540]  that was considered artificial intelligence.
[03:05.540 --> 03:07.380]  A computer can do math.
[03:07.400 --> 03:09.820]  As things became more advanced,
[03:09.820 --> 03:13.040]  different concepts were considered artificial intelligence.
[03:13.040 --> 03:15.000]  There's video game artificial intelligence,
[03:15.000 --> 03:16.820]  rule-based artificial intelligence,
[03:16.820 --> 03:18.760]  classical image recognition.
[03:20.200 --> 03:22.020]  But today we're talking about machine learning,
[03:22.020 --> 03:24.840]  which is a subset of artificial intelligence.
[03:24.840 --> 03:27.480]  And it's focused a lot more on
[03:28.020 --> 03:29.680]  grabbing inferences from data
[03:29.680 --> 03:32.320]  and making statistical patterns.
[03:33.680 --> 03:35.720]  And then inside of machine learning,
[03:35.720 --> 03:38.020]  we have a concept called deep learning.
[03:38.020 --> 03:41.520]  Deep learning is specifically deep neural networks.
[03:41.960 --> 03:44.680]  And so neural networks is a machine learning algorithm
[03:45.420 --> 03:47.180]  that's pretty popular.
[03:47.180 --> 03:48.500]  But deep learning allows us
[03:48.500 --> 03:50.200]  to take that a little bit further.
[03:50.200 --> 03:52.020]  And so usually we see this in the form
[03:52.020 --> 03:53.800]  of convolutional neural networks,
[03:53.800 --> 03:56.680]  which are primarily used for image recognition,
[03:56.680 --> 03:58.380]  or recurrent neural networks,
[03:58.380 --> 04:01.400]  which are used for cyclical pattern recognition.
[04:01.400 --> 04:05.880]  So think of things such as a heartbeat monitor,
[04:05.880 --> 04:07.400]  where there's a pattern involved
[04:08.180 --> 04:10.020]  that you kind of need to match up.
[04:10.020 --> 04:12.440]  And if there's any sort of heart irregularities,
[04:12.440 --> 04:16.320]  the pattern will diverge from predicted patterns.
[04:17.960 --> 04:22.020]  So if we wanted to formally define machine learning,
[04:22.020 --> 04:24.140]  I would say that it's a set of statistical techniques
[04:24.140 --> 04:26.600]  that enable the process of information mining,
[04:26.600 --> 04:29.480]  pattern discovery, and drawing inferences from data.
[04:29.540 --> 04:32.580]  So the idea is that the algorithms learn
[04:32.580 --> 04:34.820]  from past data to predict future outcomes,
[04:34.820 --> 04:37.740]  instead of creating a different algorithm
[04:37.740 --> 04:40.080]  for a different scenario.
[04:40.400 --> 04:44.580]  And we'll see that as we go through the workbooks,
[04:44.580 --> 04:47.140]  how we're able to create multiple different types
[04:47.140 --> 04:50.420]  of algorithms around the same set of data,
[04:50.420 --> 04:51.860]  or how we can use the same algorithms
[04:51.860 --> 04:53.640]  with different types of data.
[04:54.440 --> 04:56.120]  So some key examples,
[04:56.120 --> 04:57.540]  especially in the security sphere,
[04:57.540 --> 04:59.920]  is domain generation algorithms.
[04:59.920 --> 05:02.060]  Domain generation algorithms are heavily used
[05:02.060 --> 05:04.120]  by botnets so that the bot herder
[05:04.120 --> 05:07.760]  can stay in contact with the botnet itself,
[05:07.760 --> 05:11.000]  while government agencies or ISPs
[05:11.000 --> 05:12.840]  are taking down the domains
[05:12.840 --> 05:14.580]  that they use to communicate.
[05:14.940 --> 05:17.680]  We also see this with web application firewalls.
[05:17.680 --> 05:19.240]  We can perform anomaly detection
[05:19.240 --> 05:25.100]  to identify normal web traffic,
[05:25.100 --> 05:27.880]  normal parameters that are sent by a web client,
[05:27.880 --> 05:30.260]  versus potentially malicious parameters.
[05:30.260 --> 05:31.780]  In this example here,
[05:31.780 --> 05:35.180]  it's showing a possible SQL injection,
[05:35.180 --> 05:36.940]  and it's calling that out.
[05:37.220 --> 05:39.320]  And then, of course, network anomaly detection,
[05:39.320 --> 05:41.840]  where what we can do is predict
[05:43.300 --> 05:46.880]  how the network will operate over a period of time,
[05:46.880 --> 05:50.780]  and then keep track of how it is actually operating.
[05:50.780 --> 05:53.360]  And if it's beyond a certain delta or a difference,
[05:53.360 --> 05:55.240]  then we can flag that and send it off
[05:55.240 --> 05:56.740]  to a security analyst.
[05:58.040 --> 06:00.300]  So why is this important for security analysts,
[06:00.300 --> 06:02.200]  and also important for hackers?
[06:02.700 --> 06:05.460]  Well, today, over a quarter of security products
[06:05.460 --> 06:08.640]  for detection have some form of machine learning.
[06:08.800 --> 06:10.260]  So from an analyst's perspective,
[06:10.260 --> 06:11.600]  what you want to do is understand
[06:12.840 --> 06:13.220]  how these machines work,
[06:13.220 --> 06:15.440]  so that you can ensure that they operate efficiently
[06:16.580 --> 06:18.280]  and are working effectively.
[06:19.100 --> 06:21.360]  Machine learning isn't this magical black box
[06:21.360 --> 06:23.180]  that just solves all of our problems.
[06:23.920 --> 06:25.480]  And I want to share a story with you,
[06:25.480 --> 06:28.240]  and this is what really made the idea of machine learning
[06:28.240 --> 06:31.020]  and the value of machine learning really stick out to me.
[06:31.480 --> 06:32.520]  So a number of years ago,
[06:32.620 --> 06:35.420]  a friend of mine was approached by a non-profit.
[06:35.420 --> 06:37.200]  And what this non-profit did was track
[06:37.200 --> 06:39.180]  cheetah population out in Africa.
[06:39.180 --> 06:41.360]  So they set up a bunch of different cameras,
[06:42.840 --> 06:44.960]  and they would take a picture at different intervals,
[06:44.960 --> 06:46.620]  and it would be sent off to a human analyst
[06:46.620 --> 06:49.900]  to identify whether or not there's a cheetah in the picture.
[06:50.420 --> 06:52.020]  Not a job I would want to do,
[06:52.020 --> 06:54.880]  but it needs done, and it's really important.
[06:56.100 --> 06:57.220]  So they approached my friend,
[06:57.220 --> 06:58.400]  and they were saying, hey, is there a way
[06:58.400 --> 07:00.260]  that we can increase our efficiency?
[07:00.260 --> 07:02.680]  Something that we can use to kind of automate
[07:02.680 --> 07:04.780]  some of the manual labor that we're doing.
[07:05.180 --> 07:06.840]  And so what he did was spend about a week,
[07:06.840 --> 07:09.860]  and he used a very simple machine learning model
[07:11.300 --> 07:13.140]  to identify images.
[07:13.200 --> 07:15.840]  And what he told this model was,
[07:15.840 --> 07:17.440]  hey, if you see an image,
[07:17.440 --> 07:19.940]  and you are at least 5% confident
[07:19.940 --> 07:22.080]  that there's a cheetah in the picture,
[07:22.080 --> 07:23.740]  send it off to an analyst.
[07:23.740 --> 07:25.680]  If you're less than 5% confident,
[07:25.680 --> 07:26.920]  just throw it away.
[07:27.640 --> 07:29.640]  Now 5%, that doesn't really sound like a lot,
[07:29.640 --> 07:31.160]  and to be honest, it's not.
[07:31.320 --> 07:33.740]  But what this non-profit was able to do
[07:33.740 --> 07:35.860]  with this simple model was
[07:37.240 --> 07:39.680]  what would normally take them a year of labor
[07:39.860 --> 07:41.980]  they were able to complete in one month.
[07:42.620 --> 07:45.500]  That is a 1200% increase in productivity
[07:45.500 --> 07:47.720]  by just implementing a little bit of machine learning
[07:47.720 --> 07:50.640]  and that fuzzy logic into their process
[07:50.640 --> 07:52.320]  to kind of help automate it.
[07:52.860 --> 07:54.160]  So this is what I want to do
[07:54.160 --> 07:56.520]  to kind of push for the security analysts
[07:56.520 --> 07:57.600]  and for the hacker community
[07:58.100 --> 07:59.500]  to get a little bit more involved
[07:59.500 --> 08:00.740]  in AI and machine learning
[08:00.740 --> 08:03.940]  so that we can see and identify ways
[08:03.940 --> 08:05.860]  that we can improve our own processes,
[08:06.660 --> 08:07.880]  especially ones that require
[08:07.960 --> 08:09.060]  a little bit of fuzzy logic,
[08:09.060 --> 08:10.480]  things that traditional signatures
[08:10.480 --> 08:14.160]  or scripting aren't sufficient in.
[08:16.380 --> 08:18.600]  So the 7-step machine learning process
[08:18.600 --> 08:20.500]  generally looks like this.
[08:20.660 --> 08:22.720]  And this isn't a list
[08:22.720 --> 08:25.560]  so much as a cycle, right?
[08:26.560 --> 08:28.080]  So first we're going to start
[08:28.080 --> 08:28.980]  with gathering the data.
[08:28.980 --> 08:32.240]  We want to identify what it is
[08:32.240 --> 08:33.840]  we want our machine learning algorithm
[08:33.840 --> 08:35.660]  to classify.
[08:35.960 --> 08:37.520]  So first we're going to gather the data
[08:37.880 --> 08:38.720]  and then we're going to prepare the data
[08:38.720 --> 08:40.660]  in such a way that is easy
[08:40.660 --> 08:43.100]  for machine learning models to understand.
[08:43.300 --> 08:44.640]  And in this case, it's usually turning it
[08:44.640 --> 08:45.740]  into some form of number
[08:45.740 --> 08:47.440]  or some form of signal.
[08:48.780 --> 08:50.840]  Most of our time is spent gathering the data
[08:50.840 --> 08:51.860]  and preparing the data.
[08:51.860 --> 08:54.280]  The rest of this tends to be a little bit quicker.
[08:55.160 --> 08:56.860]  So first, based on the data that we have
[08:56.860 --> 08:58.840]  and the problem set that were presented,
[08:58.840 --> 09:00.260]  we choose a model.
[09:00.260 --> 09:02.240]  So some models perform better than others
[09:02.240 --> 09:03.420]  in certain tasks.
[09:03.420 --> 09:04.720]  And sometimes you just got to kind of
[09:07.880 --> 09:08.780]  figure out maybe neural nets isn't
[09:08.780 --> 09:10.120]  the best model for you.
[09:10.120 --> 09:12.620]  Maybe logistic regression or SVM
[09:12.620 --> 09:15.220]  or random forest classifier is better for you.
[09:15.680 --> 09:17.140]  But once we've identified the model
[09:17.140 --> 09:19.560]  or models, then we go ahead and train it.
[09:19.560 --> 09:21.540]  And so what we'll usually do is perform
[09:21.540 --> 09:23.600]  what's called a training testing split.
[09:23.600 --> 09:25.760]  And in this case, we have a series
[09:25.760 --> 09:29.320]  or a lot of clean labeled data.
[09:29.480 --> 09:30.900]  And we take that data,
[09:30.900 --> 09:33.440]  we separate 80% of it and 20% of it.
[09:33.440 --> 09:35.880]  80% of it is what we use to train the model.
[09:35.880 --> 09:37.000]  And then the other 20%
[09:37.880 --> 09:40.000]  is what is used to evaluate the model
[09:40.000 --> 09:41.940]  to see how well the model is performing.
[09:42.840 --> 09:45.220]  And so it's kind of like how we teach
[09:45.220 --> 09:47.160]  students in elementary school, right?
[09:47.160 --> 09:49.500]  We train them on the bulk of information
[09:49.500 --> 09:51.380]  and then we test them to see how well
[09:51.380 --> 09:53.140]  they're understanding the material.
[09:54.040 --> 09:55.880]  And so some of our models
[09:56.680 --> 09:57.740]  will perform very well.
[09:57.740 --> 10:00.640]  Some of our models may not.
[10:00.640 --> 10:02.000]  And so if we have a model
[10:02.000 --> 10:03.640]  that performs fairly well,
[10:03.640 --> 10:04.800]  but we think it could do better,
[10:04.800 --> 10:06.500]  then we hop into what is called
[10:07.100 --> 10:08.260]  hyperparameter tuning.
[10:08.500 --> 10:09.900]  And we're going to see what a hyperparameter
[10:09.900 --> 10:11.600]  looks like in a little bit.
[10:11.880 --> 10:13.600]  But hyperparameter tuning,
[10:13.600 --> 10:16.180]  think of it as little knobs and levers
[10:16.180 --> 10:17.540]  and switches that we can use
[10:17.540 --> 10:18.760]  to kind of tweak our model
[10:18.760 --> 10:20.160]  and tune it in such a way
[10:20.160 --> 10:24.160]  that it is better at a certain type of problem.
[10:24.980 --> 10:26.440]  And then once we've completed all this,
[10:26.440 --> 10:27.680]  we're ready to deploy.
[10:28.000 --> 10:30.040]  Now the reason why this is cyclical, right?
[10:30.040 --> 10:31.300]  We want to gather more data,
[10:31.300 --> 10:32.900]  prepare the data,
[10:34.800 --> 10:36.000]  and use the same model
[10:36.000 --> 10:38.500]  that we're already using after deploy
[10:38.500 --> 10:40.040]  is because machine learning
[10:40.580 --> 10:42.860]  stops learning as soon as it's trained.
[10:43.300 --> 10:44.840]  So to kind of give you an idea,
[10:44.840 --> 10:46.820]  I know that certain spam filters
[10:46.820 --> 10:48.660]  like Facebook's Facebook spam filter,
[10:48.660 --> 10:50.260]  they retrain it and rebuild it
[10:50.260 --> 10:51.380]  every one to three days
[10:51.380 --> 10:53.960]  because that's how quickly spam
[10:53.960 --> 10:56.460]  is being processed through their pipelines
[10:56.460 --> 10:58.260]  and through their networks.
[10:58.320 --> 10:59.580]  And so it's a way for them
[10:59.580 --> 11:01.420]  to kind of stay on top of that.
[11:04.280 --> 11:06.040]  So we're going to hop into
[11:06.900 --> 11:08.880]  understanding how machine learning works,
[11:08.880 --> 11:10.240]  especially in the security space.
[11:10.240 --> 11:12.240]  And the easiest example, in my opinion,
[11:12.240 --> 11:14.040]  is building a spam filter.
[11:14.600 --> 11:15.520]  Now spam filters,
[11:15.520 --> 11:17.900]  we already know how they operate conceptually.
[11:18.180 --> 11:20.660]  So it's a lower barrier of entry.
[11:20.660 --> 11:22.520]  We're going to see specifically
[11:22.520 --> 11:24.580]  how the machine learning fits in.
[11:24.760 --> 11:26.120]  But before I hop into that,
[11:26.120 --> 11:27.340]  are there any questions
[11:27.340 --> 11:29.700]  on how machine learning,
[11:29.700 --> 11:31.380]  AI, and deep learning
[11:31.380 --> 11:32.300]  kind of intermingle
[11:32.300 --> 11:33.600]  and why it's important
[11:33.600 --> 11:35.400]  to understand machine learning
[11:35.400 --> 11:36.580]  from a security standpoint
[11:36.580 --> 11:39.220]  or in the hacking community?
[11:46.490 --> 11:48.330]  I see on the Twitch chat,
[11:48.330 --> 11:49.250]  reinforcement learning,
[11:49.250 --> 11:50.670]  reinforcement learning, self-learning,
[11:50.670 --> 11:52.290]  contextual regret minimization,
[11:52.290 --> 11:54.110]  what differences between them?
[11:54.410 --> 11:57.210]  I'm not familiar with counterfactual,
[11:57.210 --> 11:58.850]  not familiar with counterfactual
[11:58.850 --> 12:00.810]  regret minimization,
[12:00.810 --> 12:04.910]  but typically you'll see
[12:06.430 --> 12:07.970]  three types of machine learning.
[12:07.970 --> 12:09.450]  You'll see classification
[12:09.450 --> 12:10.910]  or supervised learning,
[12:10.910 --> 12:12.490]  which is what we're going to be covering.
[12:12.490 --> 12:14.350]  You'll see unsupervised learning,
[12:14.350 --> 12:17.430]  which is very good for data exploration.
[12:18.870 --> 12:20.170]  And unsupervised learning
[12:20.170 --> 12:21.990]  uses a whole different set of algorithms.
[12:21.990 --> 12:23.550]  And then there's reinforcement learning,
[12:23.550 --> 12:26.870]  where we embed in a model
[12:26.870 --> 12:29.170]  different rewards and punishments
[12:29.170 --> 12:32.410]  based on how it's performing.
[12:32.690 --> 12:33.770]  And so these are all different ways
[12:33.770 --> 12:34.870]  of training the model,
[12:34.870 --> 12:37.590]  and then we can make decisions off of those.
[12:39.830 --> 12:42.490]  Hey, Sven, are there any questions
[12:42.490 --> 12:45.070]  in the Discord chat before I move on?
[12:55.840 --> 12:57.780]  Okay, I'm going to assume not.
[12:59.480 --> 13:02.620]  So before we look at a full spam filter,
[13:02.620 --> 13:04.740]  we want to kind of look at a smaller example.
[13:04.880 --> 13:06.000]  So in this case,
[13:06.000 --> 13:07.200]  we have a series of sentences
[13:07.200 --> 13:10.020]  talking about sports or not sports,
[13:10.020 --> 13:11.800]  which are really elections.
[13:12.260 --> 13:13.980]  So here we have a couple sentences.
[13:13.980 --> 13:14.920]  A, great games.
[13:15.100 --> 13:16.760]  Sorry. Yes.
[13:18.260 --> 13:20.420]  Well, I think you got this one.
[13:21.340 --> 13:23.200]  But there was a question from Norwin.
[13:23.200 --> 13:24.680]  I fell asleep on the job.
[13:26.340 --> 13:27.660]  No worries.
[13:27.780 --> 13:29.280]  It has been a long weekend.
[13:30.260 --> 13:31.780]  So what was the question?
[13:34.340 --> 13:38.680]  Norwin kindly repeated Glucorio's question.
[13:40.200 --> 13:42.060]  So you already answered it.
[13:42.060 --> 13:43.320]  Okay, okay.
[13:45.000 --> 13:46.300]  So kind of going back,
[13:46.300 --> 13:47.820]  we have a great game,
[13:47.820 --> 13:49.260]  which is talking about sports.
[13:49.260 --> 13:50.800]  We have the election was over,
[13:50.800 --> 13:53.200]  which is talking about not sports.
[13:53.240 --> 13:55.680]  Very clean match, talking about sports.
[13:55.680 --> 13:57.140]  Clean but forgettable game,
[13:57.140 --> 13:58.400]  talking about sports.
[13:58.400 --> 14:00.580]  So if we wanted to take this idea
[14:01.780 --> 14:02.300]  and specify a sentence
[14:02.300 --> 14:03.640]  that we've never seen before
[14:03.640 --> 14:06.500]  about sports and not sports,
[14:06.500 --> 14:08.500]  how would we do this?
[14:08.700 --> 14:11.640]  We would want to look at keywords.
[14:11.900 --> 14:13.900]  So in this case,
[14:13.900 --> 14:16.380]  we can see that it's talking about games.
[14:16.380 --> 14:18.060]  We know it's probably about sports.
[14:18.060 --> 14:19.260]  If it's talking about elections,
[14:19.260 --> 14:21.720]  it's probably not talking about sports.
[14:22.500 --> 14:24.960]  And so what we want to really hone in on
[14:24.960 --> 14:26.640]  are keywords.
[14:26.940 --> 14:29.040]  But to put this into a machine learning model,
[14:29.040 --> 14:31.040]  we need to find a way
[14:31.040 --> 14:33.500]  to turn this into numbers.
[14:34.660 --> 14:35.720]  So the way that we're going to do this
[14:35.720 --> 14:37.920]  is Bayes' Theorem.
[14:37.920 --> 14:39.380]  Bayes' Theorem is...
[14:40.030 --> 14:41.760]  it may look complex at first,
[14:41.760 --> 14:42.860]  but it's actually a really easy
[14:42.860 --> 14:44.800]  to understand probability theorem.
[14:45.480 --> 14:46.080]  And in this case,
[14:46.080 --> 14:47.940]  we see the probability of A given B
[14:47.940 --> 14:50.260]  equals probability of B given A
[14:50.260 --> 14:51.660]  times the probability of A
[14:51.660 --> 14:53.540]  divided by the probability of B.
[14:53.540 --> 14:54.220]  So in this case,
[14:54.220 --> 14:55.580]  we want to see what is the probability
[14:56.660 --> 14:59.120]  that this sentence is talking about sports
[14:59.120 --> 15:01.020]  given that the sentence is
[15:01.020 --> 15:02.800]  A very close game.
[15:03.280 --> 15:06.220]  So we don't know the probability of sports
[15:06.220 --> 15:07.640]  given A very close game,
[15:07.640 --> 15:09.440]  but we can easily plug that into this formula
[15:09.440 --> 15:11.780]  and say, well, we know the probability of sports.
[15:11.780 --> 15:13.900]  We saw three out of our five sentences
[15:13.900 --> 15:15.120]  were talking about sports.
[15:15.120 --> 15:16.820]  And we can figure out the probability
[15:16.820 --> 15:19.060]  of A very close game given sports
[15:19.060 --> 15:21.500]  and the probability of A very close game.
[15:22.200 --> 15:23.140]  But how do we do that?
[15:24.220 --> 15:25.540]  How do we get a sentence like this?
[15:26.000 --> 15:26.940]  Well, with Bayes theorem,
[15:26.940 --> 15:29.240]  we can actually, or with probability,
[15:29.240 --> 15:31.020]  actually, we can split this up.
[15:31.020 --> 15:32.520]  So instead of A very close game
[15:32.520 --> 15:33.620]  as a whole sentence,
[15:33.620 --> 15:35.500]  we can look at the probability of A
[15:35.500 --> 15:36.740]  times the probability of very
[15:36.740 --> 15:39.160]  times close times game.
[15:39.640 --> 15:41.540]  And so what this turns into
[15:41.540 --> 15:44.000]  in our A very close game given sports
[15:45.060 --> 15:47.220]  is probability of A in sports,
[15:47.220 --> 15:48.820]  probability of very in sports,
[15:48.820 --> 15:49.840]  and so on.
[15:49.840 --> 15:50.920]  And of course, we're going to want to do
[15:54.220 --> 15:55.540]  is say, okay,
[15:55.540 --> 15:56.920]  whichever is highest,
[15:56.920 --> 15:58.040]  sports or not sports,
[15:58.040 --> 15:59.500]  that's the one that we're going to classify
[15:59.500 --> 16:02.140]  this sentence as being part of.
[16:03.380 --> 16:04.180]  So at this point,
[16:04.180 --> 16:06.260]  it becomes a simple counting game, right?
[16:06.260 --> 16:08.520]  So how many times does the word A
[16:08.520 --> 16:10.160]  show up in sports?
[16:10.400 --> 16:12.140]  Well, we see that it shows up twice,
[16:12.140 --> 16:13.040]  once in the first sentence
[16:13.040 --> 16:14.520]  and once in the fourth sentence.
[16:14.520 --> 16:15.940]  And then how many times does it show up
[16:15.940 --> 16:17.340]  in not sports?
[16:17.540 --> 16:19.600]  Once, right there in the fifth sentence.
[16:24.220 --> 16:25.360]  So A does not show up in sports.
[16:25.360 --> 16:26.920]  It only shows up in not sports.
[16:26.920 --> 16:28.040]  And then we see the word game
[16:28.040 --> 16:30.040]  show up twice in sports.
[16:31.740 --> 16:33.020]  So we just take those numbers
[16:33.020 --> 16:34.620]  and we plug them in and we say,
[16:34.620 --> 16:36.720]  okay, A showed up in sports twice,
[16:36.720 --> 16:37.940]  very showed up once,
[16:37.940 --> 16:39.740]  close showed up zero times,
[16:39.740 --> 16:41.180]  and game showed up twice.
[16:41.280 --> 16:43.120]  And if we counted the number of unique words,
[16:43.120 --> 16:45.080]  we would have had 11 unique words
[16:45.080 --> 16:46.240]  across all of our sentences
[16:46.640 --> 16:48.020]  talking about sports.
[16:48.880 --> 16:50.240]  But we run into a problem
[16:54.620 --> 16:56.280]  and the problem is that
[16:56.280 --> 16:58.100]  what we want our machine learning model to do
[16:58.100 --> 17:00.560]  is identify and classify sentences
[17:00.560 --> 17:02.400]  that it has never seen before.
[17:02.960 --> 17:04.540]  Unfortunately, it's never seen close
[17:04.540 --> 17:06.060]  and so that winds up being a zero,
[17:06.060 --> 17:08.380]  zero times everything is zero.
[17:08.380 --> 17:10.260]  And we see that that easily cascades down
[17:10.260 --> 17:11.380]  and just ruins our model.
[17:11.380 --> 17:13.120]  So it's because we've never seen the word close,
[17:13.120 --> 17:14.280]  there's a zero probability
[17:14.780 --> 17:17.180]  that this exists in our sentence.
[17:18.080 --> 17:19.320]  So that's a problem,
[17:19.760 --> 17:21.480]  but we can get around that.
[17:22.700 --> 17:23.740]  There's another algorithm
[17:24.220 --> 17:26.060]  called multinomial Naive Bayes.
[17:26.180 --> 17:28.000]  And so this is just a small modification
[17:28.000 --> 17:29.700]  and what it does is add
[17:29.700 --> 17:31.140]  what is called a smoothing filter.
[17:31.140 --> 17:33.080]  And we can see this alpha here.
[17:33.080 --> 17:35.740]  It's added to the numerator and the denominator.
[17:35.800 --> 17:37.460]  And usually this alpha is one,
[17:37.460 --> 17:39.420]  but this is going to be our hyperparameter
[17:39.420 --> 17:41.600]  for Naive Bayesian,
[17:41.600 --> 17:42.820]  multinomial Naive Bayesian
[17:42.820 --> 17:44.540]  that we can tweak, right?
[17:44.540 --> 17:46.360]  We can set it to one, we can set it to two,
[17:46.360 --> 17:47.900]  we can set it to 0.1.
[17:48.040 --> 17:50.840]  And so this is a thing that we can tweak
[17:50.840 --> 17:52.280]  to see if we can get our model
[17:54.220 --> 17:54.660]  to work for us.
[17:55.440 --> 17:56.780]  So we take the same idea.
[17:56.780 --> 17:57.680]  We do the word counts.
[17:57.680 --> 17:59.700]  We saw that a showed up in sports twice,
[17:59.700 --> 18:01.360]  very showed up once, close to zero.
[18:01.360 --> 18:03.180]  But now we have this plus one.
[18:03.580 --> 18:06.620]  So the probability of close given sports
[18:06.620 --> 18:08.480]  is never going to reach zero.
[18:09.180 --> 18:10.760]  And then of course, as I mentioned earlier,
[18:10.760 --> 18:13.060]  we need to do the same thing for not sports.
[18:13.060 --> 18:14.820]  And the total number of unique words
[18:14.820 --> 18:16.280]  in not sports is nine.
[18:16.280 --> 18:18.600]  The total number of unique words in sports is 11.
[18:18.600 --> 18:20.060]  And the total number of unique words
[18:20.060 --> 18:23.320]  across both classes is 14.
[18:26.080 --> 18:28.080]  So we plug in our answers
[18:28.080 --> 18:29.360]  and we do our calculations
[18:29.360 --> 18:33.680]  and we get 0.0000461 for sports
[18:33.680 --> 18:37.880]  and 0.0000143 for non-sports.
[18:38.200 --> 18:39.440]  Classifying the sentence,
[18:39.560 --> 18:42.800]  a very close game being about sports.
[18:45.480 --> 18:48.320]  So that was a lot that I just kind of laid on you.
[18:49.080 --> 18:50.400]  If you have any questions,
[18:50.400 --> 18:51.660]  type them in the chat.
[18:52.020 --> 18:53.200]  While you're doing that,
[18:54.220 --> 18:55.580]  I'm going to show you how we can
[18:56.720 --> 18:58.380]  take our raw emails
[18:59.100 --> 19:00.940]  and transform them into
[19:00.940 --> 19:02.480]  these types of counts
[19:02.480 --> 19:03.440]  and these types of numbers
[19:04.120 --> 19:06.500]  in a very clean and effective way.
[19:06.960 --> 19:09.180]  So the five things we need to keep track of
[19:09.180 --> 19:11.920]  are the total number of unique words,
[19:11.920 --> 19:14.240]  the total number of words in spam,
[19:14.240 --> 19:16.420]  the total number of words in ham,
[19:16.420 --> 19:18.080]  or regular emails,
[19:18.080 --> 19:20.420]  the count of each word in spam,
[19:20.420 --> 19:22.700]  and the count of each word in ham.
[19:25.960 --> 19:28.000]  So let's take a look at one of our emails.
[19:28.720 --> 19:29.500]  We see here,
[19:29.500 --> 19:31.620]  East Asian fonts in Lenny.
[19:31.620 --> 19:32.640]  Thanks for your support.
[19:32.640 --> 19:34.760]  Installing Unifonts did it well for me.
[19:35.340 --> 19:37.140]  Now we want to take a sentence like this
[19:37.140 --> 19:39.920]  and find a way to classify it
[19:39.920 --> 19:42.240]  as being spam or ham.
[19:42.700 --> 19:43.860]  But before we do that,
[19:43.860 --> 19:45.840]  we want to kind of point out a couple things.
[19:45.840 --> 19:47.500]  So we see here kind of towards the bottom,
[19:47.500 --> 19:49.340]  we have the word unsubscribe,
[19:49.340 --> 19:50.400]  all capitalized,
[19:50.400 --> 19:52.700]  and then unsubscribe lowercase.
[19:52.900 --> 19:55.460]  What we want to do is look at keywords.
[19:55.500 --> 19:57.540]  And so in context,
[19:57.540 --> 20:01.620]  it doesn't matter if unsubscribe is capitalized
[20:01.620 --> 20:02.860]  or the first letter is capitalized
[20:02.860 --> 20:04.180]  or it's all lowercase.
[20:04.280 --> 20:06.300]  It means the same thing.
[20:06.320 --> 20:10.900]  And so to kind of reduce our problem set,
[20:10.900 --> 20:13.420]  what we're going to do is set everything to lowercase.
[20:13.900 --> 20:15.140]  The other thing we want to do
[20:15.140 --> 20:17.180]  is remove what are called stop words.
[20:17.180 --> 20:18.700]  Stop words are words like
[20:18.700 --> 20:21.700]  the, in, of, on, or...
[20:21.700 --> 20:24.420]  They don't add any context to the sentence.
[20:24.780 --> 20:27.100]  And we're going to do the same thing for punctuation.
[20:27.200 --> 20:28.580]  So we're going to remove those.
[20:28.580 --> 20:30.300]  We're going to keep our keywords.
[20:30.300 --> 20:31.800]  And then one more thing that we're going to do
[20:31.800 --> 20:33.300]  is called stemming.
[20:33.580 --> 20:36.060]  So if we have something like the word
[20:36.820 --> 20:38.700]  congrats or congratulations,
[20:39.540 --> 20:40.540]  in the English language,
[20:40.540 --> 20:41.620]  they mean the same thing.
[20:41.620 --> 20:44.200]  So if we're looking at keywords,
[20:44.200 --> 20:45.820]  we want to treat them the same.
[20:45.820 --> 20:47.000]  And so what stemming does
[20:47.000 --> 20:49.600]  is shrink those words down to their base form
[20:49.600 --> 20:50.960]  or the root form.
[20:50.960 --> 20:52.860]  And then we can count them the same.
[20:52.940 --> 20:56.120]  Another really good example is the word thanks,
[20:56.120 --> 20:57.240]  thank you,
[20:57.240 --> 20:59.840]  where it's a just singular thank.
[20:59.840 --> 21:01.760]  And I thanked somebody today
[21:01.760 --> 21:03.380]  with an ed at the end.
[21:04.040 --> 21:05.940]  As a keyword, they all mean the same thing.
[21:05.940 --> 21:08.360]  And so stemming will reduce them all down to thank.
[21:08.540 --> 21:10.340]  And so instead of three separate words,
[21:10.340 --> 21:11.760]  it's going to be counted as one word.
[21:11.760 --> 21:13.500]  And so that's going to drastically reduce
[21:13.500 --> 21:17.200]  our dimension space.
[21:18.300 --> 21:20.560]  So after our preprocessing,
[21:20.560 --> 21:22.000]  which is what this is known as,
[21:22.000 --> 21:24.520]  we're going to see East Asian fonts in Lenny,
[21:24.520 --> 21:26.080]  or East Asian fonts Lenny,
[21:26.080 --> 21:29.240]  thanks support installing Unifonts,
[21:29.240 --> 21:30.080]  well me.
[21:31.400 --> 21:34.820]  And then what isn't shown here
[21:34.820 --> 21:36.060]  is what the stemming looks like.
[21:36.060 --> 21:37.700]  So as we go through the workbook,
[21:38.340 --> 21:40.040]  we'll take this exact same email
[21:40.040 --> 21:41.880]  and see how it's transformed
[21:41.880 --> 21:45.840]  at every step of our preprocessing.
[21:46.480 --> 21:49.260]  So are there any questions so far?
[21:55.420 --> 21:56.680]  I don't think so.
[21:56.680 --> 21:57.840]  There's none on Twitch chat,
[21:57.840 --> 21:59.240]  and there's none in the Discord.
[21:59.280 --> 22:00.320]  Perfect.
[22:02.320 --> 22:03.940]  So what we're going to have
[22:03.940 --> 22:05.860]  are a series of workbooks.
[22:05.860 --> 22:07.820]  We're going to use MyBender.
[22:07.820 --> 22:09.740]  MyBender is really awesome.
[22:10.700 --> 22:13.620]  TAs, if you don't mind sharing this link.
[22:13.880 --> 22:16.040]  What it will do is spin up a Docker container
[22:16.800 --> 22:18.860]  with the workbook linked to my GitHub
[22:18.860 --> 22:22.420]  so that you can kind of code with me
[22:22.420 --> 22:23.400]  as we walk through
[22:23.400 --> 22:26.740]  and build our own machine learning classifiers.
[22:27.980 --> 22:29.260]  While this is starting up,
[22:29.260 --> 22:30.460]  what we're going to be using is
[22:30.460 --> 22:32.520]  instead of coding everything raw
[22:32.520 --> 22:33.920]  and doing the raw numbers like we did
[22:33.920 --> 22:36.820]  on our pen and paper,
[22:36.820 --> 22:37.920]  we're going to be using
[22:37.920 --> 22:39.600]  the abstraction library Scikit-learn.
[22:39.600 --> 22:40.860]  Scikit-learn is very useful
[22:40.860 --> 22:43.520]  and it allows us to treat things
[22:43.520 --> 22:44.680]  kind of like Legos,
[22:44.680 --> 22:46.800]  where, OK, maybe I don't want
[22:46.800 --> 22:48.140]  multinomial Naive Bayesian,
[22:48.140 --> 22:49.300]  maybe I want logistic regression
[22:49.300 --> 22:50.700]  or random forest.
[22:50.700 --> 22:52.680]  We can just swap that out.
[22:54.780 --> 22:56.380]  So once it's loaded,
[22:56.380 --> 22:58.780]  and I'm going to give you a moment
[22:58.780 --> 23:00.780]  to get this loaded up.
[23:01.380 --> 23:03.120]  If it takes a little bit longer
[23:03.120 --> 23:05.280]  than it should,
[23:05.280 --> 23:07.100]  then you'll easily be able
[23:07.100 --> 23:07.980]  to catch up to me
[23:07.980 --> 23:09.600]  and I'll show you how.
[23:10.160 --> 23:11.760]  So we're going to click on workbooks
[23:13.540 --> 23:15.860]  spam filter sklearn.
[23:15.860 --> 23:17.840]  If you're not very familiar
[23:17.840 --> 23:19.940]  with Python, you can still follow along.
[23:19.940 --> 23:21.520]  We have a completed workbook,
[23:21.520 --> 23:22.660]  which is the workbook complete
[23:22.660 --> 23:24.780]  spam filter sklearn.
[23:25.720 --> 23:26.800]  And these are Jupyter notebooks.
[23:26.800 --> 23:28.120]  If you've never used Jupyter notebooks,
[23:28.120 --> 23:29.260]  I think they're the best thing
[23:29.260 --> 23:30.480]  since sliced bread,
[23:30.480 --> 23:32.380]  especially when it comes to data science.
[23:32.820 --> 23:35.260]  But it allows us to interact
[23:35.260 --> 23:36.200]  with code blocks
[23:36.200 --> 23:38.540]  and visualize things in line.
[23:38.540 --> 23:39.740]  And I think they work fantastic
[23:43.520 --> 23:44.040]  together.
[23:45.080 --> 23:47.800]  So let's go ahead, scroll down.
[23:47.800 --> 23:50.560]  We're going to run this first code block.
[23:51.100 --> 23:52.380]  And so what this is going to do
[23:52.380 --> 23:53.880]  is install a couple of libraries
[23:53.880 --> 23:54.840]  that we need,
[23:54.840 --> 23:57.020]  as well as set our data directory
[23:57.020 --> 23:59.320]  to where the data directory is
[23:59.320 --> 24:00.960]  in the GitHub project.
[24:00.960 --> 24:03.400]  So to run these, you can click run
[24:03.400 --> 24:04.360]  up at the top,
[24:04.360 --> 24:06.560]  or you can press shift enter,
[24:06.560 --> 24:07.900]  which is what I'm pressing now.
[24:07.900 --> 24:10.560]  And then you can see that it's running.
[24:10.560 --> 24:12.160]  A star in the upper corner
[24:12.160 --> 24:13.780]  means that it's currently running.
[24:13.780 --> 24:15.320]  And then a number in the upper corner
[24:15.320 --> 24:17.560]  means that that was the order
[24:17.560 --> 24:19.840]  that code block was ran in.
[24:22.560 --> 24:25.040]  So while that is going,
[24:25.040 --> 24:25.820]  I'm going to
[24:26.780 --> 24:28.220]  let it go until it stops
[24:28.220 --> 24:29.880]  scrolling my thing.
[24:30.900 --> 24:32.140]  So we're just going to import
[24:32.320 --> 24:33.180]  a couple libraries.
[24:33.180 --> 24:34.800]  We're going to import numpy,
[24:34.800 --> 24:36.820]  matplotlib for graphs,
[24:36.820 --> 24:38.280]  regular expressions we'll be using
[24:40.560 --> 24:41.360]  and NLTK.
[24:41.360 --> 24:42.540]  NLTK stands for
[24:42.540 --> 24:44.380]  Natural Language Toolkit.
[24:44.380 --> 24:46.400]  This will allow us to identify
[24:46.400 --> 24:49.100]  and collect a list of stop words
[24:49.100 --> 24:51.560]  to remove those things.
[24:51.560 --> 24:53.260]  And it will also allow us
[24:53.260 --> 24:54.900]  to use a stemmer function
[24:54.900 --> 24:57.060]  to stem the words
[24:57.060 --> 24:59.780]  and reduce our problem space.
[25:00.800 --> 25:01.940]  Train test split,
[25:01.940 --> 25:04.320]  I don't believe we're actually using that in here.
[25:04.320 --> 25:06.320]  The data directories I have
[25:06.320 --> 25:07.740]  are already split into
[25:07.740 --> 25:09.260]  80% training,
[25:09.260 --> 25:11.060]  20% testing.
[25:11.300 --> 25:13.320]  And then we're going to use our vectorizers.
[25:13.320 --> 25:15.460]  And we'll explain what the vectorizers do,
[25:15.460 --> 25:16.940]  but we're going to take a look at
[25:16.940 --> 25:19.060]  both of the TF-IDF vectorizer
[25:19.060 --> 25:20.320]  and the count vectorizer
[25:20.320 --> 25:23.160]  because they operate
[25:23.160 --> 25:24.140]  slightly different.
[25:24.140 --> 25:26.580]  Count vectorizer just takes a straight word count
[25:26.580 --> 25:29.300]  like we did with our pen and paper example.
[25:29.300 --> 25:30.760]  And then TF-IDF actually takes
[25:30.760 --> 25:32.040]  all of the words into account
[25:32.040 --> 25:34.680]  and gives them a weight.
[25:35.800 --> 25:36.600]  Scikit-learn,
[25:36.600 --> 25:39.480]  we're going to use logistic regression
[25:39.480 --> 25:41.060]  and multinomial naivetation.
[25:41.060 --> 25:42.460]  And the reason why we're going to use both of these
[25:42.460 --> 25:44.040]  is because we don't actually know
[25:44.040 --> 25:46.140]  which one's going to perform best.
[25:46.180 --> 25:47.420]  And so like good data scientists,
[25:47.420 --> 25:49.140]  we're going to try a couple different models
[25:49.140 --> 25:51.160]  and take the one that performs best
[25:51.160 --> 25:52.960]  with the problem set that we have.
[25:53.560 --> 25:55.780]  And then for a couple metric
[25:55.780 --> 25:57.140]  and visualizations,
[25:57.140 --> 25:58.800]  we're going to use the Scikit-learn
[25:58.800 --> 26:01.640]  confusion matrix and classification report.
[26:03.040 --> 26:04.060]  So I'm going to go ahead
[26:04.060 --> 26:05.200]  and press Shift-Enter
[26:05.200 --> 26:06.240]  and that's going to run.
[26:06.240 --> 26:08.360]  You will see libraries imported
[26:08.360 --> 26:09.960]  once that's complete.
[26:11.520 --> 26:13.300]  And we can move on to the next thing.
[26:13.620 --> 26:14.700]  So this is our test email.
[26:14.700 --> 26:15.840]  This is the exact same email
[26:15.840 --> 26:17.760]  that we saw in the slides.
[26:17.980 --> 26:19.440]  So as I mentioned earlier,
[26:19.440 --> 26:20.500]  we're going to use this email
[26:20.500 --> 26:22.880]  to see how it's transformed
[26:22.880 --> 26:25.040]  at every step of preprocessing.
[26:25.040 --> 26:26.540]  So I'm going to go ahead and run that.
[26:27.980 --> 26:30.000]  And then this is our tokenizer function.
[26:30.000 --> 26:32.260]  Tokenizer functions will allow us
[26:32.260 --> 26:33.500]  to pull out the keywords
[26:33.500 --> 26:35.620]  in the way that we would like them
[26:35.620 --> 26:37.060]  to be extracted.
[26:37.180 --> 26:38.760]  So in this case, we're going to grab
[26:38.760 --> 26:41.080]  all the punctuation, all the stop words,
[26:41.080 --> 26:42.480]  and we're going to create a
[26:42.480 --> 26:45.020]  stemmer constructor.
[26:45.020 --> 26:47.640]  We're going to use these to
[26:47.640 --> 26:49.260]  remove the stop words from the email.
[26:49.260 --> 26:51.280]  We're going to use these to remove
[26:51.280 --> 26:52.380]  punctuation from the email,
[26:52.380 --> 26:54.120]  and then we're going to stem the words.
[26:56.540 --> 26:56.940]  We're going to move all the words
[26:56.940 --> 26:58.280]  to lowercase so that they are
[26:58.280 --> 27:00.100]  treated exactly the same.
[27:00.980 --> 27:02.100]  So we're just going to go ahead
[27:02.100 --> 27:03.900]  and run this, shift enter,
[27:03.900 --> 27:05.400]  and our tokenizer is defined.
[27:05.960 --> 27:07.180]  And this is where we run into
[27:07.180 --> 27:09.100]  our first task, which is to
[27:09.100 --> 27:10.420]  tokenize an email.
[27:10.780 --> 27:12.260]  So our task says that we want
[27:12.260 --> 27:14.420]  to print the full email, test email,
[27:14.420 --> 27:15.240]  and then we're going to print the
[27:15.240 --> 27:17.680]  results of the tokenized version
[27:17.680 --> 27:18.920]  of that email.
[27:19.580 --> 27:21.140]  So here I'm just going to do
[27:27.880 --> 27:31.040]  print tokenizer test email.
[27:31.520 --> 27:32.820]  And we can go ahead and run that,
[27:32.820 --> 27:35.000]  shift enter. We can see the
[27:35.360 --> 27:36.880]  original email that we have,
[27:36.880 --> 27:38.160]  East Asian Fonts and Lenny,
[27:38.160 --> 27:39.660]  thanks for your support.
[27:39.740 --> 27:41.160]  We can see, just like in our
[27:41.160 --> 27:43.360]  slides, we have East Asian Font
[27:43.360 --> 27:45.820]  Lenny, slightly changed,
[27:45.820 --> 27:48.400]  thank instead of thanks,
[27:48.400 --> 27:50.500]  support install, because we can
[27:50.500 --> 27:53.900]  have installing installed with
[27:53.900 --> 27:56.320]  an ed at the end, or I
[27:56.320 --> 27:58.220]  will install this software,
[27:58.220 --> 28:00.140]  which is just a standard install.
[28:01.340 --> 28:03.080]  We see unsubscribe has been
[28:03.080 --> 28:04.560]  changed. And two of the things
[28:04.560 --> 28:05.480]  that I want to point out is
[28:05.480 --> 28:06.740]  every single word in here is
[28:06.740 --> 28:07.860]  unique, except for the word
[28:07.860 --> 28:09.320]  unsubscribe, which shows up
[28:09.320 --> 28:12.060]  twice, and list.deviant.org,
[28:12.060 --> 28:13.940]  which shows up twice.
[28:16.360 --> 28:17.400]  So we're going to take a look
[28:17.400 --> 28:18.680]  at our training data. What this
[28:18.680 --> 28:21.500]  is going to do is identify the
[28:22.140 --> 28:24.980]  directories where our spam
[28:24.980 --> 28:27.100]  files are, our ham files are,
[28:27.100 --> 28:29.160]  and our test files are. We're
[28:29.160 --> 28:30.180]  going to get a count, because
[28:30.180 --> 28:31.420]  that's one of the five things
[28:31.420 --> 28:32.400]  that we needed to keep track
[28:32.400 --> 28:35.500]  of, was the count of, actually
[28:36.800 --> 28:37.800]  we want to keep track of the
[28:37.800 --> 28:38.740]  count of words, we don't care
[28:38.740 --> 28:40.700]  how many emails we have. So
[28:40.700 --> 28:41.680]  I'm going to go ahead and run
[28:41.680 --> 28:45.180]  this, and what we're going to
[28:45.180 --> 28:47.720]  do is put those emails into
[28:47.720 --> 28:48.900]  what is known as a corpus.
[28:48.900 --> 28:49.640]  So we're going to create two
[28:49.640 --> 28:51.660]  list arrays, so corpus and
[28:51.660 --> 28:52.840]  labels. Corpus is going to
[28:52.840 --> 28:54.820]  hold all of our emails for
[28:54.820 --> 28:58.440]  spam and ham, and then the
[28:58.440 --> 28:59.800]  labels is going to be each
[28:59.800 --> 29:01.900]  label for our data. So the
[29:01.900 --> 29:02.700]  first thing we're going to do
[29:02.700 --> 29:06.020]  is corpus equals a new list,
[29:06.020 --> 29:08.000]  labels equals a new list,
[29:08.720 --> 29:09.500]  and then we're going to do
[29:09.500 --> 29:10.560]  the same thing for spam and
[29:10.560 --> 29:11.540]  ham. We're going to load the
[29:11.540 --> 29:13.980]  email bodies from the ham
[29:13.980 --> 29:14.940]  directories and the spam
[29:14.940 --> 29:16.520]  directory into the corpus
[29:16.520 --> 29:17.120]  array, and then we're going
[29:17.120 --> 29:18.260]  to load the labels for each
[29:18.260 --> 29:20.380]  email into the labels array.
[29:20.980 --> 29:21.720]  So what that's going to
[29:21.720 --> 29:22.300]  look like is we're going to
[29:22.300 --> 29:23.280]  use a for loop here, so
[29:23.280 --> 29:25.200]  we can say for each in
[29:25.200 --> 29:27.780]  os.listdir, and then we
[29:27.780 --> 29:29.340]  have our data directory,
[29:29.340 --> 29:31.400]  which is defined above.
[29:31.900 --> 29:33.260]  We're going to look at our
[29:33.820 --> 29:35.260]  ham directory for this
[29:35.260 --> 29:37.480]  first round. With each
[29:37.480 --> 29:38.800]  open file, we're going to
[29:38.800 --> 29:41.700]  look at our data directory
[29:41.700 --> 29:44.920]  plus ham plus os.listdir.
[29:44.940 --> 29:45.320]  We're going to look at
[29:45.320 --> 29:46.860]  each, because each is the
[29:46.860 --> 29:48.280]  name of our file, and
[29:48.280 --> 29:48.840]  we're going to want to
[29:48.840 --> 29:50.840]  read those as the file
[29:50.840 --> 29:52.980]  descriptor f. So for
[29:52.980 --> 29:55.800]  our corpus, we can do
[29:55.800 --> 29:59.320]  an append and then f.read.
[29:59.320 --> 30:00.380]  So this will read each
[30:00.380 --> 30:01.500]  email into our corpus
[30:01.500 --> 30:03.420]  and then labels.append
[30:04.080 --> 30:06.560]  ham, because we know
[30:06.560 --> 30:08.380]  that all of the emails
[30:08.380 --> 30:09.440]  in the ham directory are
[30:09.440 --> 30:12.380]  part of the ham data set.
[30:12.460 --> 30:13.660]  And so we can easily do
[30:13.660 --> 30:14.920]  the same thing for spam
[30:14.940 --> 30:15.940]  which is we need to be
[30:15.940 --> 30:17.300]  careful on how we change
[30:17.300 --> 30:18.320]  this. So instead of
[30:18.320 --> 30:19.280]  slash ham, it's going to
[30:19.280 --> 30:21.440]  be slash spam, and then
[30:21.440 --> 30:22.640]  of course our label is
[30:22.640 --> 30:24.140]  also going to be spam.
[30:26.060 --> 30:27.260]  So I'm going to let this
[30:27.260 --> 30:28.100]  sit for a minute and let
[30:28.100 --> 30:29.760]  you guys go ahead and
[30:29.760 --> 30:30.500]  make sure that you copy
[30:30.500 --> 30:32.000]  this down correctly.
[30:33.520 --> 30:35.520]  This bit is pretty easy
[30:35.520 --> 30:36.220]  if you're familiar with
[30:36.220 --> 30:37.600]  Python. It might be a
[30:37.600 --> 30:38.800]  little confusing if you
[30:38.800 --> 30:40.780]  are not, but that's okay.
[30:40.780 --> 30:42.140]  If you ever get lost at
[30:42.140 --> 30:43.800]  any point, just remember
[30:43.800 --> 30:44.700]  that there are a
[30:44.700 --> 30:46.060]  completed solutions manual
[30:46.060 --> 30:47.990]  in the workbooks directory.
[30:50.580 --> 30:52.840]  So I'm going to go ahead
[30:52.840 --> 30:55.340]  and run this. So shift
[30:55.340 --> 30:56.930]  enter to run.
[30:57.740 --> 30:59.740]  And I spelled labels
[30:59.740 --> 31:04.160]  wrong. Labels, shift
[31:04.160 --> 31:05.980]  enter to run. Loading
[31:05.980 --> 31:07.810]  ham, loading spam, done.
[31:08.940 --> 31:10.040]  So what this is going to
[31:10.040 --> 31:12.600]  do is graphically load
[31:12.600 --> 31:14.680]  or graphically visualize
[31:14.680 --> 31:16.260]  how many ham emails we
[31:16.260 --> 31:17.140]  have versus how many
[31:17.140 --> 31:18.240]  spam emails we have.
[31:18.240 --> 31:18.820]  And this is something we
[31:18.820 --> 31:19.720]  want to generally do with
[31:19.720 --> 31:21.000]  our data just to see if
[31:21.000 --> 31:21.680]  there's any sort of
[31:21.680 --> 31:23.280]  imbalanced data sets.
[31:23.380 --> 31:24.440]  This one we have about
[31:24.440 --> 31:28.600]  50% spam per ham.
[31:29.280 --> 31:30.320]  It's like 2 to 1
[31:30.320 --> 31:32.560]  ratio of ham to spam.
[31:32.560 --> 31:34.580]  And so this I'm not
[31:34.580 --> 31:35.740]  going to consider an
[31:35.740 --> 31:36.880]  imbalanced data set for
[31:36.880 --> 31:38.980]  this example, but if we
[31:38.980 --> 31:40.000]  have something such as
[31:40.000 --> 31:41.080]  like credit card fraud
[31:41.080 --> 31:41.940]  where it happens less
[31:41.940 --> 31:43.220]  than a fraction of a
[31:43.220 --> 31:44.080]  percent of the time,
[31:44.080 --> 31:44.800]  that's something that we
[31:44.800 --> 31:45.900]  really want to dive into
[31:45.900 --> 31:47.500]  simply because if we try
[31:47.500 --> 31:49.440]  and use this exact
[31:49.440 --> 31:51.240]  example on credit card
[31:51.240 --> 31:52.060]  fraud, it's not going to
[31:52.060 --> 31:53.040]  work because what we
[31:53.040 --> 31:54.700]  will do is create a
[31:54.700 --> 31:56.980]  model that is 99%
[31:57.540 --> 31:58.980]  accurate because it just
[31:58.980 --> 32:00.460]  classifies all credit
[32:00.460 --> 32:01.540]  card transactions as
[32:01.820 --> 32:02.400]  legitimate. Now if we
[32:02.400 --> 32:03.480]  want to identify credit
[32:03.480 --> 32:04.340]  card fraud, that's not
[32:04.340 --> 32:05.700]  very helpful for us.
[32:05.700 --> 32:06.920]  So this just allows us
[32:06.920 --> 32:08.260]  to kind of explore the
[32:08.260 --> 32:10.100]  data and see how much
[32:10.100 --> 32:11.400]  of what we have.
[32:11.400 --> 32:12.260]  So in this case we have
[32:12.260 --> 32:15.000]  2,359 ham emails and
[32:15.000 --> 32:17.780]  about 1,100 spam emails.
[32:19.700 --> 32:21.560]  So task 2A, we're going
[32:21.560 --> 32:22.500]  to view our training
[32:22.500 --> 32:24.520]  data. So this one's just
[32:24.520 --> 32:25.640]  to kind of see what our
[32:25.640 --> 32:28.260]  data looks like. We
[32:28.260 --> 32:29.380]  generally have an idea,
[32:29.380 --> 32:30.140]  but we just kind of want
[32:30.140 --> 32:30.880]  to play around with it.
[32:30.880 --> 32:31.620]  And this is another thing
[32:31.620 --> 32:32.440]  that you would want to
[32:32.440 --> 32:33.800]  do in a data discovery
[32:33.800 --> 32:34.980]  process is just kind of
[32:34.980 --> 32:35.840]  see, okay, what is it
[32:35.840 --> 32:36.300]  that we're actually
[32:36.300 --> 32:38.340]  working with? So in this
[32:38.340 --> 32:41.220]  case, I'm going to pick
[32:41.220 --> 32:43.200]  an email ID, just email
[32:43.200 --> 32:45.180]  ID 5. You can pick 5 or
[32:45.180 --> 32:46.640]  you can pick any other.
[32:47.520 --> 32:49.000]  And then I'm going to
[32:49.000 --> 32:50.560]  grab the email, which
[32:50.560 --> 32:51.580]  is part of the corpus
[32:51.580 --> 32:54.160]  array. Corpus and then
[32:54.160 --> 32:57.340]  our email ID. And then
[32:57.340 --> 32:59.740]  I'm going to print that
[32:59.740 --> 33:01.560]  email, and then I'm
[33:01.560 --> 33:02.400]  going to do the same
[33:02.400 --> 33:04.520]  thing and print the
[33:05.460 --> 33:07.300]  tokenized version using
[33:07.460 --> 33:09.100]  a tokenizer function of
[33:09.100 --> 33:10.460]  that email. So we can
[33:10.460 --> 33:11.700]  see how it's transformed.
[33:14.180 --> 33:15.840]  So just so I don't go
[33:15.840 --> 33:17.240]  too fast, I want to make
[33:17.240 --> 33:18.200]  sure that you're able to
[33:18.200 --> 33:19.660]  follow along and copy
[33:19.660 --> 33:21.940]  this down. It's fairly
[33:21.940 --> 33:24.320]  simple. So I will wait
[33:24.320 --> 33:27.620]  just a few seconds. How
[33:27.620 --> 33:28.420]  are we doing on
[33:28.420 --> 33:29.460]  questions?
[33:31.180 --> 33:32.520]  No one has been
[33:32.520 --> 33:34.160]  posting some helpful
[33:34.160 --> 33:38.640]  links, but hasn't had
[33:38.640 --> 33:39.640]  too many questions so
[33:39.640 --> 33:42.140]  far. So thank you very
[33:42.140 --> 33:43.420]  much. No winner.
[33:44.540 --> 33:46.260]  Okay. So when we run
[33:46.260 --> 33:47.760]  that, what we have is
[33:47.760 --> 33:49.100]  the email body. We see
[33:49.100 --> 33:50.280]  product review, vert
[33:50.280 --> 33:52.320]  tools, dev, 2.0 URL, and
[33:52.320 --> 33:55.020]  then this URL. We see
[33:55.020 --> 33:56.560]  product review, vert
[33:56.560 --> 33:59.280]  tool, dev, 2.0 URL as
[33:59.280 --> 34:02.060]  one word, HTTP, and then
[34:02.060 --> 34:04.600]  the full URL. So this
[34:04.600 --> 34:05.460]  is how we can identify
[34:05.460 --> 34:07.680]  and kind of track our
[34:07.680 --> 34:08.520]  data as it's being
[34:08.520 --> 34:09.620]  transformed over time.
[34:11.980 --> 34:13.200]  So task 3 is where
[34:13.200 --> 34:13.620]  we're going to start
[34:13.620 --> 34:14.700]  using our vectorizers.
[34:14.700 --> 34:15.940]  What we want to do is
[34:15.940 --> 34:17.560]  get a count of each
[34:17.560 --> 34:18.720]  unique word, and we want
[34:18.720 --> 34:20.620]  to get a count of each
[34:21.460 --> 34:23.540]  unique word per class so
[34:23.540 --> 34:24.760]  that we can use that in
[34:24.760 --> 34:26.140]  our classification. So
[34:26.140 --> 34:26.920]  there's two ways we can
[34:26.920 --> 34:27.760]  do that. We can use the
[34:27.760 --> 34:28.940]  straight count vectorizer,
[34:28.940 --> 34:29.900]  which will get the count
[34:29.900 --> 34:32.560]  of our words, and we can
[34:32.560 --> 34:34.020]  use a TF-IDF vectorizer,
[34:34.020 --> 34:35.400]  which will give us a
[34:35.400 --> 34:36.920]  weight of each count.
[34:39.480 --> 34:41.070]  So no winner says he's
[34:41.820 --> 34:44.320]  behind. He's catching up
[34:44.320 --> 34:45.300]  and he'll catch up
[34:45.300 --> 34:46.760]  later and follow along.
[34:49.500 --> 34:51.440]  I can try to figure out
[34:51.760 --> 34:53.800]  a system of helping the
[34:53.800 --> 34:55.460]  winner. I can send you a
[34:55.460 --> 34:57.500]  Zoom link or something
[34:57.500 --> 34:59.140]  and look over your
[34:59.140 --> 35:01.680]  shoulder metaphorically.
[35:01.680 --> 35:03.060]  Also, in the workbooks,
[35:03.060 --> 35:04.980]  you can just open up
[35:04.980 --> 35:06.000]  the... that's the wrong
[35:06.000 --> 35:06.660]  one... you can just open
[35:06.660 --> 35:08.600]  up the completed spam
[35:08.600 --> 35:10.640]  filter SK Learn, and
[35:10.640 --> 35:11.680]  it's completely filled out
[35:11.680 --> 35:13.080]  for you. So if you ever
[35:13.080 --> 35:13.940]  do fall behind and you
[35:13.940 --> 35:14.740]  would like to catch up,
[35:14.740 --> 35:15.760]  you can look at the
[35:15.760 --> 35:16.960]  answers there.
[35:18.820 --> 35:20.400]  There is a way to
[35:20.400 --> 35:21.840]  follow along and make
[35:21.840 --> 35:23.000]  sure your code works.
[35:23.880 --> 35:25.400]  Yeah, absolutely. And I
[35:25.400 --> 35:26.220]  like using the completed
[35:26.220 --> 35:27.140]  workbook. You can just
[35:27.140 --> 35:28.260]  run everything and see
[35:28.260 --> 35:29.320]  how things are supposed
[35:29.320 --> 35:31.200]  to be at each step of
[35:31.200 --> 35:34.140]  the process. So we're
[35:34.140 --> 35:34.600]  going to go ahead and
[35:34.600 --> 35:35.540]  train our vectorizers.
[35:35.540 --> 35:36.380]  And the instructions tell
[35:36.380 --> 35:37.240]  us exactly how to do
[35:37.240 --> 35:38.340]  this. So we're going to
[35:38.340 --> 35:39.520]  use the vectorizer.
[35:39.520 --> 35:40.060]  We're going to call it
[35:40.060 --> 35:41.040]  cvec for the count
[35:41.040 --> 35:42.680]  vectorizer, tvec for the
[35:42.680 --> 35:44.040]  tfidf vectorizer, and
[35:44.040 --> 35:44.640]  we're going to perform
[35:44.640 --> 35:47.440]  fit transform. Fit in
[35:47.440 --> 35:48.980]  scikit-learn world is
[35:49.420 --> 35:51.200]  code for train. So this
[35:51.200 --> 35:51.820]  is going to, quote
[35:51.820 --> 35:52.800]  unquote, train our
[35:52.800 --> 35:53.900]  vectorizers so they can
[35:53.900 --> 35:54.740]  get the count of each
[35:54.740 --> 35:55.460]  words. And then we're
[35:55.460 --> 35:55.880]  going to save the
[35:55.880 --> 35:57.840]  results as count x and
[35:58.260 --> 36:00.940]  tfidf x. So what that
[36:00.940 --> 36:02.540]  looks like is cvec
[36:03.740 --> 36:04.540]  equals count
[36:05.260 --> 36:06.380]  vectorizer. And then we
[36:06.380 --> 36:07.400]  want to use our
[36:07.400 --> 36:08.300]  tokenizer that we
[36:08.300 --> 36:10.580]  defined above. So the
[36:10.580 --> 36:11.860]  tokenizer that we're
[36:11.860 --> 36:12.920]  going to use is called
[36:13.720 --> 36:15.100]  tokenizer. I know, not
[36:15.100 --> 36:16.960]  confusing at all. But
[36:18.360 --> 36:19.440]  we're going to use our
[36:21.920 --> 36:23.340]  count x. And pay
[36:23.340 --> 36:23.880]  attention to your
[36:23.880 --> 36:25.380]  capitals. Scikit-learn
[36:25.380 --> 36:27.060]  is very picky and so is
[36:27.060 --> 36:28.880]  my workbook. We're going
[36:28.880 --> 36:30.710]  to use cvec.fit
[36:32.380 --> 36:33.660]  transform. And then
[36:33.660 --> 36:34.520]  we're going to train
[36:34.520 --> 36:36.340]  transform or count the
[36:36.340 --> 36:39.400]  words for our corpus.
[36:40.160 --> 36:40.600]  And then we're going
[36:40.600 --> 36:41.280]  to do the same thing
[36:41.280 --> 36:42.220]  with our tfidf
[36:42.220 --> 36:44.540]  vectorizer. tvec
[36:44.540 --> 36:46.600]  equals tfidf
[36:46.600 --> 36:48.360]  vectorizer. And we're
[36:48.360 --> 36:49.180]  going to use our
[36:49.180 --> 36:50.720]  tokenizer called
[36:50.720 --> 36:52.820]  tokenizer. And then
[36:52.820 --> 36:54.680]  our tfidf x
[36:54.680 --> 36:59.220]  equals tvec.fit
[36:59.220 --> 37:02.020]  transform. And it's
[37:02.020 --> 37:02.860]  also going to be on
[37:02.860 --> 37:04.500]  our corpus. So I'm
[37:04.500 --> 37:05.160]  going to let this sit
[37:05.160 --> 37:07.240]  for a minute and let
[37:07.240 --> 37:08.120]  you guys be able to
[37:08.120 --> 37:13.060]  catch up. And this
[37:13.060 --> 37:13.820]  takes a little bit.
[37:13.820 --> 37:15.800]  This takes about two
[37:15.800 --> 37:17.840]  minutes to run because
[37:17.840 --> 37:18.620]  what it's doing is it's
[37:18.620 --> 37:19.640]  reading every single
[37:19.640 --> 37:21.000]  email and it is
[37:21.000 --> 37:22.260]  counting the words in
[37:22.260 --> 37:25.100]  those emails in order
[37:25.100 --> 37:26.060]  to get the number of
[37:26.060 --> 37:30.760]  unique words. So I'm
[37:30.760 --> 37:31.680]  going to go ahead and
[37:31.680 --> 37:33.320]  run this because it's
[37:33.320 --> 37:34.100]  going to take a little
[37:34.100 --> 37:35.800]  bit. But in case you
[37:35.800 --> 37:37.240]  need, here is the
[37:37.240 --> 37:38.440]  code while this is
[37:38.440 --> 37:39.240]  operating. And so we
[37:39.240 --> 37:40.560]  can see that it is
[37:40.560 --> 37:41.560]  currently training the
[37:41.560 --> 37:42.640]  count vectorizer, kind
[37:42.640 --> 37:43.320]  of down here at the
[37:43.320 --> 37:45.000]  bottom. And then once
[37:45.000 --> 37:45.780]  the count vectorizer is
[37:45.780 --> 37:46.540]  complete, it will start
[37:46.540 --> 37:47.600]  training the tfidf
[37:47.600 --> 37:49.280]  vectorizer. And then it
[37:49.280 --> 37:50.420]  will let us know when
[37:50.420 --> 38:02.480]  that is complete. So
[38:02.480 --> 38:03.440]  while this is going,
[38:03.440 --> 38:04.220]  I'm going to go ahead
[38:04.220 --> 38:05.860]  and move on to the
[38:05.860 --> 38:06.980]  next code block. We
[38:06.980 --> 38:07.800]  can write the code
[38:07.800 --> 38:08.500]  while this is still
[38:08.500 --> 38:09.400]  training, but we won't
[38:09.400 --> 38:10.100]  be able to run it
[38:10.100 --> 38:10.860]  because it can run
[38:10.860 --> 38:12.080]  only one code block at
[38:12.180 --> 38:13.800]  a time. So in this
[38:13.800 --> 38:14.720]  case, we have task
[38:14.720 --> 38:16.240]  3a, which is to count
[38:16.240 --> 38:18.060]  the test email tokens.
[38:18.060 --> 38:18.900]  So we're going to
[38:18.900 --> 38:20.120]  manually count out
[38:20.880 --> 38:22.420]  each of the tokens.
[38:22.420 --> 38:23.280]  So what we're going to
[38:23.280 --> 38:27.480]  do is say for i in
[38:27.480 --> 38:28.800]  and this is a bit of
[38:28.960 --> 38:29.820]  a trick so that we can
[38:29.820 --> 38:31.800]  keep the count as well
[38:31.800 --> 38:33.040]  as the words in order.
[38:33.040 --> 38:33.560]  So we're going to use
[38:33.740 --> 38:35.620]  a list of a dictionary
[38:38.820 --> 38:40.420]  and we're going to
[38:40.420 --> 38:43.240]  want to grab the
[38:43.240 --> 38:44.260]  tokenized version of
[38:44.260 --> 38:45.080]  our email. So let's
[38:45.080 --> 38:48.720]  call this tokenized
[38:49.360 --> 38:51.280]  email equals tokenizer
[38:52.860 --> 38:55.500]  test email. And then
[38:55.500 --> 38:56.620]  our from keys we're
[38:56.620 --> 38:57.300]  going to have tokenized
[38:57.300 --> 38:58.380]  email which we saw
[38:58.380 --> 38:59.660]  above is going to be a
[38:59.660 --> 39:00.420]  list. So we'll just
[39:00.420 --> 39:03.260]  call that tokenized
[39:03.260 --> 39:05.040]  email and then we're
[39:05.040 --> 39:06.580]  going to count the
[39:06.580 --> 39:09.620]  words. So print, do
[39:09.620 --> 39:10.400]  some Python format,
[39:10.400 --> 39:23.000]  And then we're going to look at our tokenized email.count of each word, and then the word itself.
[39:23.440 --> 39:29.660]  So we see our vectorizing is complete, so we are ready to run this.
[39:29.660 --> 39:36.060]  I'm going to let this sit for about 30 more seconds and let you guys catch up.
[39:36.060 --> 39:50.970]  And then, of course, after this workshop, I have a whole series of workbooks, as well as their solutions.
[39:50.970 --> 39:55.830]  So if you're interested in other ways to apply machine learning and security to different problems,
[39:55.830 --> 40:00.050]  and using different algorithms, don't forget to take a look at those.
[40:02.630 --> 40:08.350]  So I'm going to go ahead and run this, Shift-Enter, and we can see the counts.
[40:08.350 --> 40:11.490]  So we have East Asian font, Lenny, thanks, support.
[40:11.830 --> 40:17.710]  Each of these, we have our unique words, they only show up once in our email.
[40:17.710 --> 40:23.830]  The only two that are outliers is unsubscribe, or unsubscrib, after our stemmer.
[40:23.830 --> 40:28.890]  And this list.debian.org, they show up twice in our email.
[40:28.890 --> 40:33.030]  So this is if we manually counted these, but what if we used our vectorizers?
[40:33.770 --> 40:38.470]  So what we're going to do here in Task 3B is create new vectorizers,
[40:38.470 --> 40:40.770]  the count vectorizer and TF-IDF vectorizer,
[40:40.770 --> 40:47.430]  and then we're going to print out the counts that each of them represent.
[40:47.670 --> 40:51.950]  So in the first one here, I'll just call this example-cvec,
[40:51.950 --> 40:57.150]  and then we're going to define a new vectorizer.
[41:02.220 --> 41:13.760]  And I'll just call this example-x equals example-cvec.fit-transform.
[41:14.340 --> 41:18.100]  And then it expects a list, our email is a string,
[41:18.100 --> 41:20.700]  so we're going to wrap it in these square brackets.
[41:21.000 --> 41:23.600]  And we're just going to use our raw test email.
[41:26.080 --> 41:30.960]  And then we're going to want to print our example-x.
[41:33.140 --> 41:38.180]  And we can easily just copy and paste, do the same thing for our TF-IDF code.
[41:38.180 --> 41:43.700]  We'll just call this example-tvec, we'll use the TF-IDF vectorizer,
[41:43.700 --> 41:48.120]  tokenizer is tokenizer, we change this to tvec,
[41:48.120 --> 41:50.160]  and then we're already done with our example-x,
[41:50.160 --> 41:52.840]  so we don't need to change the name, we can just leave it as is.
[41:53.980 --> 41:58.440]  Example-tvec, fit-transform, and then square brackets, test-email,
[41:58.440 --> 41:59.940]  so that it's a list.
[42:00.860 --> 42:01.760]  Yes.
[42:04.900 --> 42:07.800]  Oh, it sounded like there was possibly a question.
[42:13.190 --> 42:18.510]  So Voto on the stream is asking, where are those workbooks again?
[42:24.050 --> 42:28.070]  Can one of the TAs post in the stream?
[42:28.070 --> 42:29.750]  There you go, perfect.
[42:30.270 --> 42:32.170]  Machine learning for a security analyst.
[42:33.530 --> 42:37.830]  Okay, so I'm going to go ahead and run this.
[42:38.570 --> 42:41.030]  Scroll up, let's take a look at the count vectorizer.
[42:41.130 --> 42:43.690]  So instead of seeing the words, we see these numbers,
[42:43.690 --> 42:44.890]  and that's exactly what we want.
[42:44.890 --> 42:47.090]  We've created what are called tokens.
[42:47.330 --> 42:50.970]  So the words themselves, the tokenizers will be able,
[42:50.970 --> 42:53.750]  or the vectorizers will be able to match them together.
[42:55.090 --> 42:59.870]  So all of them are one in their count, all of them are unique,
[42:59.870 --> 43:03.150]  except for token 16 and token 9.
[43:03.150 --> 43:07.810]  We don't know which one, either unsubscribe or list.wn.org,
[43:07.810 --> 43:12.910]  maps to 16 or 9, but what we do know is that both of them show up twice.
[43:12.910 --> 43:17.410]  So given that, list.wn.org is either 16 or 9,
[43:17.410 --> 43:21.990]  and then the other one belongs to unsubscribe, or the word unsubscribe.
[43:21.990 --> 43:27.010]  So this looks exactly the same as if we had done our raw count.
[43:27.650 --> 43:30.810]  If we look at our TF-IDF vectorizer, though,
[43:30.810 --> 43:32.470]  we see something a little differently.
[43:32.470 --> 43:36.030]  TF-IDF, what it does is use term frequency
[43:36.030 --> 43:39.950]  to give each term or token a weight.
[43:40.030 --> 43:44.310]  And so in this case, we see all of the ones that showed up once
[43:44.310 --> 43:49.950]  have a weight of 0.204, token 9 and token 16
[43:50.450 --> 43:53.790]  have a weight of 0.408.
[43:54.270 --> 43:56.690]  So you don't really need to understand the underlying math,
[43:56.690 --> 44:00.930]  you just need to know that there are different ways to vectorize your data.
[44:01.590 --> 44:05.550]  So there's TF-IDF, count vectorizer, word2vec,
[44:05.550 --> 44:08.130]  and a couple others, bag of words.
[44:08.870 --> 44:12.470]  And so there's different ones that we can try and see which one works better for us.
[44:14.170 --> 44:19.090]  So let's hop down and start to load our testing data.
[44:19.090 --> 44:22.390]  So loading our testing data, what this is going to do is
[44:22.990 --> 44:30.450]  just show us the names of each of our files in our testing directory.
[44:31.430 --> 44:34.250]  And we have train email.txt,
[44:34.250 --> 44:36.650]  and then at the end we have the class that it belongs to.
[44:36.650 --> 44:39.030]  So we have ham, we have spam.
[44:39.750 --> 44:42.770]  So what we're going to want to do as part of loading our testing data
[44:42.770 --> 44:45.310]  is we're going to want to strip off these labels.
[44:45.910 --> 44:48.530]  So we're going to read the full email into our corpus,
[44:48.530 --> 44:49.850]  and we're going to strip off these labels,
[44:49.850 --> 44:55.010]  and those are going to be the new labels for our testing data.
[44:56.410 --> 45:00.110]  So that hops us into task 4 to load the testing data.
[45:00.110 --> 45:03.530]  Just like loading the training data, we're going to create two list arrays,
[45:03.530 --> 45:06.810]  our corpus, we're going to call this test corpus, and our test labels.
[45:06.970 --> 45:09.790]  And then we're going to load each email body,
[45:09.790 --> 45:14.310]  and then strip off the label and throw that into the test labels array.
[45:16.450 --> 45:20.810]  So what that's going to look like is...
[45:25.550 --> 45:28.530]  So we're going to look at... what was the name of that?
[45:28.730 --> 45:30.770]  Test corpus.
[45:30.950 --> 45:32.930]  And we're just going to set that as a blank list,
[45:32.930 --> 45:37.270]  and then test labels as a blank list.
[45:37.990 --> 45:41.010]  For loading each of the emails, we're going to do just like we did before,
[45:41.010 --> 45:44.510]  filename in ls.listdir,
[45:44.510 --> 45:50.130]  and then our data directory plus the test directory.
[45:50.130 --> 45:51.430]  So we're going to look at each filename,
[45:51.430 --> 45:56.010]  we're going to open each file,
[45:56.010 --> 46:03.110]  so data.dir plus test plus filename,
[46:03.110 --> 46:08.230]  and then we want to read those as the file descriptor F.
[46:09.190 --> 46:14.090]  So our test corpus, we're going to append F.read,
[46:14.090 --> 46:17.290]  which is going to read that email into our test corpus array,
[46:17.290 --> 46:19.830]  and then we want to grab the label.
[46:19.830 --> 46:23.950]  And in this case, I'm just going to use a simple regular expression.
[46:25.390 --> 46:30.670]  We're going to use txt, and then this is a backslash period,
[46:33.330 --> 46:36.710]  and filename 1.
[46:36.710 --> 46:40.050]  I'm going to scroll up real quick, because the reason why I want to do that is
[46:40.050 --> 46:45.350]  what this is saying is, where we see txt.period, we're going to split,
[46:45.350 --> 46:49.690]  so we have this full train txt.period, and then we have ham.
[46:49.910 --> 46:54.650]  And then the bracket 1, what that is going to do is say,
[46:54.650 --> 46:57.890]  this is bracket 0 once it's split, and then this is bracket 1,
[46:57.890 --> 47:02.610]  so we're just going to keep our labels, which are at the end of that split.
[47:05.230 --> 47:09.990]  And then with that label, or each label that we grab from the filename,
[47:09.990 --> 47:17.350]  we're going to append it to our labels directory, or our labels array.
[47:19.210 --> 47:21.450]  So this is what our code looks like.
[47:21.450 --> 47:25.990]  Double checking for errors, looks good to me.
[47:26.390 --> 47:28.710]  And I'm going to let this sit for a little bit,
[47:28.710 --> 47:31.630]  so you guys can go ahead and copy that down.
[47:39.290 --> 47:41.170]  How are we doing?
[47:45.460 --> 47:50.780]  No questions in Twitch stream, are there any questions in the Discord channel?
[47:57.980 --> 48:01.420]  Okay, I'm going to go ahead, shift, enter, and run.
[48:01.420 --> 48:04.040]  So this is going to quickly load the emails.
[48:06.240 --> 48:09.400]  And then let's run this next code block.
[48:09.400 --> 48:13.060]  So what this is going to do is allow us to visualize our data, just like we did before.
[48:13.060 --> 48:19.200]  So we see in our testing set, we have 590 ham emails, 276 spam emails,
[48:19.200 --> 48:23.040]  and the graph itself looks nearly identical to the one above,
[48:23.040 --> 48:29.160]  and that's because we did a random 80-20 split for our training and testing.
[48:30.300 --> 48:33.020]  And that was already done beforehand.
[48:34.460 --> 48:38.280]  So the next thing we want to do is use the vectorizers that we created,
[48:38.280 --> 48:42.640]  the cvec and the tvec vectorizer,
[48:42.640 --> 48:46.280]  and take a look at the test corpus,
[48:46.280 --> 48:51.780]  and get our counts of each word in the new emails.
[48:52.560 --> 48:58.560]  So let's go ahead, testCountX equals, and we're already going to use cvec,
[48:58.560 --> 49:01.900]  we're not going to use fitTransform, instead we're going to use transform,
[49:01.900 --> 49:06.420]  because we're not trying to train it, we're just trying to save the words themselves.
[49:07.680 --> 49:20.040]  We'll do testCorpus, and then testTFIDFX equals tvec.transform testCorpus.
[49:20.280 --> 49:25.940]  And that's all we need to do when we're using a piece of testing data.
[49:27.700 --> 49:30.320]  So I'm going to let that sit for a second,
[49:30.320 --> 49:33.620]  actually I'm going to run it because this is going to take a little bit as well.
[49:36.620 --> 49:38.900]  Slightly shorter than when we were training it,
[49:38.900 --> 49:44.940]  but it still needs to go through all the emails and apply the counts,
[49:44.940 --> 49:47.460]  and then the TFIDF, it needs to apply the weights.
[49:47.460 --> 49:49.640]  We already know what the counts are, we already know what the weights are,
[49:49.640 --> 49:51.640]  we just need to apply them to each token.
[49:52.220 --> 49:53.840]  So vectorizing is complete.
[49:54.520 --> 49:59.220]  And now we're finally on to testing and evaluating our model.
[49:59.640 --> 50:02.280]  So you remember how I said data collection and preprocessing
[50:02.280 --> 50:05.460]  is what's going to take the most amount of time?
[50:05.460 --> 50:07.640]  Well, that was about 80% of this workbook.
[50:08.140 --> 50:11.340]  If not, maybe a little more, maybe 85%.
[50:12.320 --> 50:14.740]  But yeah, data science in a nutshell.
[50:15.680 --> 50:20.620]  So here we have a very helpful report generator function.
[50:20.920 --> 50:24.900]  It's not important to know what's going on in this,
[50:24.900 --> 50:26.280]  it's just going to be a helper function,
[50:26.280 --> 50:29.280]  but we do need to know that it's going to take in our confusion matrix,
[50:29.280 --> 50:32.560]  our score, which is our accuracy, and the classification report,
[50:32.560 --> 50:35.120]  and then it's going to print out all the stats that we need.
[50:36.520 --> 50:39.860]  So that hops us into task 6A,
[50:39.860 --> 50:43.400]  which is going to be training and evaluating our first model,
[50:43.400 --> 50:48.340]  which is going to be multinomial Naive Bayesian using the TFIDF vectorizer.
[50:49.980 --> 50:53.940]  So there's a lot of steps to this, but we're going to walk through it step by step.
[50:54.200 --> 50:58.160]  First, what we're going to do is create a multinomial Naive Bayesian constructor,
[50:58.160 --> 51:01.420]  so that's going to look like mnb, tf, idf,
[51:01.420 --> 51:04.160]  and the names of these variables are very important
[51:04.160 --> 51:08.000]  because that's what's being used in the generator function.
[51:08.980 --> 51:14.320]  So I'm going to call this Numial Naive Bayesian,
[51:14.840 --> 51:18.660]  and then we're going to train the multinomial Naive Bayesian,
[51:18.660 --> 51:24.200]  as I already misspelled something, mnb, tf, idf, .fit,
[51:24.200 --> 51:27.040]  which is scikit-learn's code for train,
[51:27.040 --> 51:31.300]  and then we're going to use our TFIDF, our vectorized corpus,
[51:31.300 --> 51:33.440]  with the TFIDF array.
[51:35.360 --> 51:38.060]  So now we're going to get a couple stats,
[51:38.060 --> 51:40.060]  so we're going to get our score,
[51:40.060 --> 51:48.740]  mnb, tf, idf, equals mnb, tf, idf, .score.
[51:48.740 --> 51:50.400]  This is going to give us our accuracy,
[51:50.400 --> 51:54.500]  and we want to compare this against our test, tf, idf, x,
[51:54.500 --> 51:57.380]  and our test labels.
[51:58.260 --> 52:01.140]  And then we want to take a look at the raw predictions,
[52:01.140 --> 52:04.680]  so we're going to have our predictions, mnb, tf, idf,
[52:05.400 --> 52:11.600]  equals mnb, tf, idf, .predict, using the predict function,
[52:11.600 --> 52:14.760]  on our test, tf, idf, x.
[52:15.540 --> 52:16.400]  Cool.
[52:16.400 --> 52:19.760]  And with those results, what we are going to do
[52:19.760 --> 52:23.700]  is generate a classification, a confusion matrix,
[52:23.700 --> 52:25.900]  so we can see our true positives, false positives,
[52:25.900 --> 52:27.540]  true negatives, false negatives.
[52:27.580 --> 52:39.780]  So cmatrix, mnb, tf, idf, equals our confusion matrix,
[52:39.780 --> 52:43.720]  and then we're going to use our test labels
[52:43.720 --> 52:50.800]  and our predictions, mnb, tf, idf, which is our raw predictions.
[52:50.800 --> 52:53.220]  We're going to do the same thing for our classification report,
[52:53.220 --> 52:58.680]  so we'll see report, mnb, tf, idf,
[52:59.940 --> 53:02.300]  classification report, and in the same order,
[53:02.300 --> 53:09.320]  test labels and our predictions, mnb, tf, idf.
[53:10.680 --> 53:13.900]  So I'm going to let this sit for a second,
[53:14.580 --> 53:17.520]  give you a chance to copy all of that down.
[53:19.580 --> 53:24.600]  So what's really helpful here is seeing how
[53:27.200 --> 53:30.720]  at each of these tasks, 6a, 6b, 6c, and 6d,
[53:31.320 --> 53:34.240]  we're only going to need to change very few things.
[53:34.240 --> 53:37.500]  We're going to need to change, instead of tf, idf,
[53:37.500 --> 53:40.600]  we're going to change this to count to do our train and evaluate
[53:40.600 --> 53:43.500]  the multinomial Naive Bayesian with the count model.
[53:43.720 --> 53:46.620]  And then we're going to change this from
[53:46.620 --> 53:48.520]  the multinomial Naive Bayesian constructor
[53:48.520 --> 53:51.280]  to the logistic regression constructor.
[53:51.400 --> 53:53.840]  But other than that, the rest of our code is going to be identical.
[53:53.840 --> 53:56.280]  The rest of our preprocessing is going to be identical.
[53:56.480 --> 53:59.240]  And so this will allow us to train a total of four algorithms
[53:59.960 --> 54:03.860]  using two separate machine learning algorithms
[54:03.860 --> 54:05.620]  as well as two separate vectorizers.
[54:06.340 --> 54:09.100]  So this is going to be very helpful for us.
[54:10.400 --> 54:14.040]  So I'm going to go ahead and run this.
[54:14.040 --> 54:16.260]  So shift-enter to run.
[54:16.260 --> 54:20.240]  It looks like I misspelled something.
[54:23.120 --> 54:24.560]  Multinomial.
[54:24.840 --> 54:26.660]  There we go.
[54:27.280 --> 54:29.440]  Required argument y.
[54:29.900 --> 54:31.400]  That one's new.
[54:32.980 --> 54:36.580]  Give me one second.
[54:36.900 --> 54:39.320]  Oh, that is what I forgot.
[54:39.360 --> 54:42.500]  So we want to train our multinomial Naive Bayesian
[54:42.500 --> 54:46.340]  with our data as well as our labels
[54:46.340 --> 54:50.120]  so that it knows what belongs to spam,
[54:50.120 --> 54:53.980]  what belongs to ham, just kind of in our pen and paper example,
[54:53.980 --> 54:56.560]  what belonged to sports and what belonged to not sports.
[54:57.020 --> 54:58.940]  So now I can run that.
[55:00.460 --> 55:04.360]  And I see where we ran into that problem yesterday.
[55:04.580 --> 55:05.260]  Okay.
[55:06.600 --> 55:08.740]  I need to swap.
[55:08.740 --> 55:11.800]  This is the problem that I had with scikit-learn earlier.
[55:13.140 --> 55:15.560]  The test labels and the predictions.
[55:15.560 --> 55:18.720]  We start with the predictions and then we follow up with the test labels
[55:18.720 --> 55:20.500]  instead of the other way around.
[55:21.240 --> 55:23.100]  So I go ahead and run that.
[55:23.800 --> 55:24.840]  There we go.
[55:24.840 --> 55:29.200]  So we know that we had 590 ham emails.
[55:29.200 --> 55:32.660]  We know that we had, if we add these two numbers together,
[55:33.440 --> 55:34.920]  276 spam emails.
[55:34.920 --> 55:38.200]  So we can see the actual number of ham emails, the actual number of spam emails
[55:38.200 --> 55:39.960]  in those columns.
[55:40.800 --> 55:42.640]  But let me scroll up real quick.
[55:42.760 --> 55:45.660]  So this is our classification report.
[55:45.700 --> 55:49.240]  Classification report we're not really going to use in this example,
[55:49.240 --> 55:51.500]  but it's good to know.
[55:51.500 --> 55:54.980]  There are other ways to measure how good a model is performing,
[55:54.980 --> 55:56.800]  beyond accuracy.
[55:57.500 --> 56:00.400]  Precision, recall, and F1 score
[56:00.400 --> 56:03.780]  are all used in their own ways.
[56:03.820 --> 56:07.700]  I'm not going to go into that, but it's definitely worth understanding.
[56:08.040 --> 56:10.780]  Instead, we're just going to look at the straight accuracy.
[56:10.780 --> 56:13.400]  For this multinomial with TF-IDF,
[56:13.400 --> 56:15.940]  we have a .875
[56:15.940 --> 56:19.860]  or a 87.5% accuracy.
[56:19.860 --> 56:23.000]  If we look at our true positives and our true negatives,
[56:23.000 --> 56:26.500]  it correctly classified or predicted all ham emails
[56:26.500 --> 56:28.920]  as being part of the ham class.
[56:28.960 --> 56:32.460]  So these are all of our true positives for what is ham.
[56:32.540 --> 56:34.720]  And then our true negatives, what is spam,
[56:34.720 --> 56:40.280]  it only classified 168 out of the 276 that we have.
[56:40.720 --> 56:43.080]  The other half of the spam emails
[56:43.080 --> 56:46.660]  it also thought were ham or regular emails.
[56:46.660 --> 56:49.720]  So our predictor is letting some spam through,
[56:49.720 --> 56:52.880]  but the good news is it's not blocking any of our
[56:52.880 --> 56:54.920]  legitimate emails.
[56:58.820 --> 57:02.660]  Moving on to the next example, we're going to do the same thing.
[57:02.660 --> 57:03.980]  We're going to use multinomial naivete
[57:04.540 --> 57:07.700]  with count vectorizer.
[57:07.700 --> 57:11.020]  This is going to be the same, and then be count.
[57:13.020 --> 57:15.660]  I am actually a little lazy,
[57:15.660 --> 57:17.640]  so what I'm going to do
[57:17.640 --> 57:20.640]  is just copy this
[57:20.640 --> 57:22.460]  and I can show you all you need to do
[57:22.460 --> 57:27.020]  is just change a few things around and you have a working model.
[57:27.020 --> 57:29.060]  So instead of multinomial tfidf,
[57:29.060 --> 57:31.980]  I'm going to be using multinomial count.
[57:31.980 --> 57:33.800]  We're going to use the same constructor.
[57:34.780 --> 57:36.940]  So in this case it's called count,
[57:36.940 --> 57:41.580]  and instead of using the tfidfx, we're going to use the countx.
[57:44.320 --> 57:45.280]  score,
[57:45.280 --> 57:46.760]  count,
[57:49.240 --> 57:50.140]  mnb,
[57:50.140 --> 57:51.260]  count,
[57:51.260 --> 57:56.680]  and our test labels stay the same because we didn't vectorize those.
[57:56.680 --> 57:58.920]  There's no reason for us to change this.
[58:00.120 --> 58:04.160]  I'm just going to go ahead and ctrl-c and ctrl-v some stuff.
[58:05.900 --> 58:09.360]  So all I'm doing is just changing the word tfidf
[58:09.360 --> 58:11.100]  to the word count.
[58:11.100 --> 58:14.300]  Everything else is staying the same.
[58:15.040 --> 58:17.040]  We have our predictions,
[58:17.040 --> 58:18.300]  mnb, count,
[58:18.300 --> 58:22.840]  our test labels, that stays the same, we don't modify it.
[58:22.900 --> 58:25.140]  mnb, count,
[58:25.140 --> 58:29.760]  and then we change our predictions.
[58:30.600 --> 58:31.540]  So I'm going to go ahead
[58:31.540 --> 58:33.680]  and let that sit for a moment.
[58:33.680 --> 58:36.460]  I'll let you guys either manually type that in
[58:36.460 --> 58:40.940]  or do the copy-and-paste track that I did,
[58:40.940 --> 58:42.700]  which I highly recommend.
[58:42.700 --> 58:44.980]  It just saves quite a bit of typing.
[58:46.280 --> 58:48.620]  And then we'll type out the logistic regression
[58:48.620 --> 58:51.860]  in a moment because that's the next model that we'll be making.
[58:54.780 --> 58:57.600]  How are we doing on questions?
[59:01.160 --> 59:03.520]  I think we're all doing well.
[59:04.300 --> 59:06.980]  Everything's quiet on the Discord chat, too.
[59:11.580 --> 59:12.380]  Awesome.
[59:12.380 --> 59:14.040]  So I'm going to go ahead and run this.
[59:14.040 --> 59:16.340]  I'm going to do Shift-Enter.
[59:17.640 --> 59:20.280]  And if we look at our accuracy, our multinomial
[59:20.280 --> 59:25.220]  with the count vectorizer, we have a 94% accuracy.
[59:25.380 --> 59:27.660]  So quite a decent bump
[59:27.660 --> 59:29.920]  compared to our tfidf vectorizer
[59:29.920 --> 59:32.700]  using the same multinomial algorithm.
[59:32.700 --> 59:36.760]  What we see is about 26 of our emails...
[59:36.760 --> 59:39.620]  not about, it's exactly 26 of our emails
[59:39.620 --> 59:41.900]  in our test set got blocked.
[59:41.900 --> 59:43.880]  They were considered spam.
[59:44.480 --> 59:46.940]  These are the regular ham emails.
[59:46.940 --> 59:49.480]  But we have significantly fewer spam emails
[59:49.480 --> 59:53.900]  that are getting through our predictor.
[59:53.900 --> 59:57.340]  And so more of them are being classified as spam correctly.
[59:57.340 --> 01:00:00.040]  And so this is what we want to see is a dark blue over in this corner.
[01:00:00.040 --> 01:00:03.720]  And a darker blue
[01:00:03.720 --> 01:00:07.080]  or this lighter blue given the count
[01:00:07.080 --> 01:00:09.080]  over in the bottom right corner.
[01:00:09.560 --> 01:00:12.200]  So this kind of hops into a conversation
[01:00:12.200 --> 01:00:15.840]  about false positives and false negatives.
[01:00:15.840 --> 01:00:19.000]  Which would you like to prioritize?
[01:00:19.980 --> 01:00:22.340]  And so it really depends on the model.
[01:00:22.340 --> 01:00:25.220]  In security, one of the things that we're concerned about
[01:00:25.220 --> 01:00:26.860]  is impacting the users.
[01:00:26.860 --> 01:00:30.980]  So if you look at something like an antivirus,
[01:00:30.980 --> 01:00:33.100]  it looks for things that are probably
[01:00:33.100 --> 01:00:36.280]  talking out to the internet, trying to take over your webcam,
[01:00:36.280 --> 01:00:38.660]  uploading and downloading files.
[01:00:38.720 --> 01:00:40.820]  Well, what else does that?
[01:00:41.400 --> 01:00:44.980]  Chrome, Internet Explorer, Firefox.
[01:00:45.760 --> 01:00:48.100]  So when you're building
[01:00:49.260 --> 01:00:50.720]  an antivirus,
[01:00:50.720 --> 01:00:53.940]  it's better to have
[01:00:53.940 --> 01:00:56.660]  what they consider false negatives.
[01:00:57.480 --> 01:01:00.540]  Which is positive is
[01:01:00.540 --> 01:01:03.180]  it is malware and negative is it is not malware.
[01:01:03.180 --> 01:01:05.920]  It's benign. So it's better to allow
[01:01:05.920 --> 01:01:09.000]  some viruses to run instead of
[01:01:09.820 --> 01:01:11.560]  blocking legitimate applications.
[01:01:11.560 --> 01:01:14.140]  Because you don't want to impact the user experience.
[01:01:14.140 --> 01:01:17.980]  Because if you impact the user experience, it's just going to uninstall your antivirus software.
[01:01:17.980 --> 01:01:21.120]  And that leaves them unprotected and that's bad business for you
[01:01:21.120 --> 01:01:23.320]  when people are uninstalling your software.
[01:01:24.960 --> 01:01:27.300]  So even for us, we want to consider
[01:01:27.300 --> 01:01:29.460]  and weigh the factors of
[01:01:29.460 --> 01:01:32.140]  is it better for us to allow some spam to get through
[01:01:32.140 --> 01:01:35.000]  as long as we're not blocking any of the legitimate emails?
[01:01:35.000 --> 01:01:38.180]  Or do we want to make sure that absolutely no spam gets through
[01:01:38.180 --> 01:01:42.480]  and if we block legitimate emails, well that's just collateral damage.
[01:01:42.560 --> 01:01:44.520]  So these are considerations that you need to make
[01:01:44.520 --> 01:01:47.600]  when you're building models like these and as a data scientist.
[01:01:49.180 --> 01:01:52.180]  So for the multinomial naïve Bayesian
[01:01:52.180 --> 01:01:55.360]  with the count vectorizer, we have 94% accuracy.
[01:01:55.740 --> 01:01:58.640]  So let's try a different model.
[01:01:58.640 --> 01:02:00.840]  This one we're going to use the logistic regression.
[01:02:01.120 --> 01:02:03.600]  So if we look at the instructions, we're going to use the
[01:02:03.600 --> 01:02:06.780]  logistic regression constructor, but we're going to use this
[01:02:07.580 --> 01:02:09.580]  LBFGS solver. And this is one of the
[01:02:09.580 --> 01:02:12.900]  hyperparameters, or at least I would consider it a hyperparameter
[01:02:13.660 --> 01:02:16.360]  for logistic regression itself.
[01:02:16.980 --> 01:02:19.420]  There's a couple different solvers that you can use
[01:02:19.420 --> 01:02:22.020]  but in this case we're going to use the LBFGS.
[01:02:22.360 --> 01:02:23.540]  So what this is going to look like is
[01:02:23.540 --> 01:02:26.700]  LGS TF IDF equals
[01:02:30.340 --> 01:02:31.140]  logistic regression
[01:02:31.720 --> 01:02:33.800]  and then we're going to set our solver
[01:02:34.340 --> 01:02:36.140]  equals LBFGS
[01:02:37.940 --> 01:02:40.200]  and then of course, just like we did with the multinomial
[01:02:40.200 --> 01:02:42.680]  naïve Bayesian, we're going to fit
[01:02:42.680 --> 01:02:46.000]  our train, our classifier
[01:02:46.000 --> 01:02:48.600]  we're going to use the TF IDF
[01:02:49.260 --> 01:02:51.860]  X, which is our test data, and then we're going to
[01:02:51.860 --> 01:02:54.580]  use our training data, sorry, and then we're going to use
[01:02:54.580 --> 01:02:55.980]  our labels.
[01:02:57.620 --> 01:03:00.080]  Next we're going to hop in and do the score predictions
[01:03:00.530 --> 01:03:03.780]  classification matrix and, sorry, confusion matrix
[01:03:03.780 --> 01:03:05.640]  and classification reports. We're going to use
[01:03:06.360 --> 01:03:08.680]  score LGS TF IDF
[01:03:08.680 --> 01:03:13.000]  LGS TF IDF.score
[01:03:13.000 --> 01:03:15.000]  We're just going to look at our
[01:03:15.000 --> 01:03:17.800]  test TF IDF X and our
[01:03:17.800 --> 01:03:20.920]  test labels. We're going to have our
[01:03:20.920 --> 01:03:22.060]  predictions
[01:03:23.270 --> 01:03:25.940]  LGS TF IDF equals
[01:03:26.680 --> 01:03:30.120]  LGS TF IDF.predict
[01:03:32.920 --> 01:03:34.520]  That's going to give us our
[01:03:34.520 --> 01:03:35.620]  raw predictions
[01:03:35.620 --> 01:03:38.740]  and then I'm going to
[01:03:38.740 --> 01:03:40.540]  stop saying exactly what I'm typing
[01:03:42.180 --> 01:03:44.360]  but we're going to do our confusion matrix
[01:03:59.660 --> 01:04:00.860]  and our
[01:04:01.700 --> 01:04:02.820]  classification report
[01:04:17.280 --> 01:04:19.380]  Awesome. So I'm going to let this sit
[01:04:19.380 --> 01:04:22.240]  here for a little bit, let you guys
[01:04:22.240 --> 01:04:24.560]  kind of copy and follow along
[01:04:27.120 --> 01:04:28.380]  Hopefully we don't run into
[01:04:28.380 --> 01:04:31.060]  the same problems that we did last time
[01:04:33.740 --> 01:04:35.140]  Bender is fantastic
[01:04:35.140 --> 01:04:37.260]  I love it, but it will time out after about
[01:04:37.260 --> 01:04:40.440]  five minutes and once it
[01:04:40.440 --> 01:04:42.400]  times out you've got to kind of restart the whole thing
[01:04:43.440 --> 01:04:45.300]  or at least you've got to restart the notebook
[01:04:45.300 --> 01:04:47.160]  which is kind of annoying
[01:04:48.400 --> 01:04:51.220]  Another thing you will notice is the logistic regression
[01:04:52.020 --> 01:04:54.500]  will take a little longer
[01:04:56.060 --> 01:04:59.280]  to train and give us our predictions
[01:04:59.280 --> 01:05:01.760]  and it's just the way that logistic regression operates
[01:05:01.760 --> 01:05:03.140]  it does a little bit more mathematical
[01:05:04.240 --> 01:05:06.220]  or complicated mathematical formulas
[01:05:07.000 --> 01:05:08.920]  than the multinomial Naive Bayesian
[01:05:10.080 --> 01:05:12.220]  so it just takes a little bit longer to train
[01:05:12.220 --> 01:05:15.140]  but not too much longer, so that's good
[01:05:17.360 --> 01:05:19.180]  One of them is matrix multiplication
[01:05:19.180 --> 01:05:21.320]  which is a whole array of floats
[01:05:21.320 --> 01:05:23.500]  that you have to process versus
[01:05:23.500 --> 01:05:27.080]  Naive Bayes is literally counting and some logs
[01:05:28.600 --> 01:05:30.340]  when you're doing the math correctly
[01:05:30.340 --> 01:05:31.640]  which is very quick
[01:05:31.640 --> 01:05:34.540]  It's just counting
[01:05:34.540 --> 01:05:36.960]  Yeah, it's just counting
[01:05:36.960 --> 01:05:42.820]  and logs, the latest x86 stuff
[01:05:42.820 --> 01:05:45.060]  have built-in processes
[01:05:45.060 --> 01:05:47.100]  for doing integer logs and stuff like that
[01:05:47.100 --> 01:05:48.400]  because it's so common
[01:05:48.400 --> 01:05:51.820]  to use logs for numerical accuracy
[01:05:51.820 --> 01:05:53.340]  in machine learning
[01:05:54.720 --> 01:05:57.380]  You'll also notice the same thing with neural networks
[01:05:57.380 --> 01:05:59.640]  because neural networks do a lot of matrix multiplication
[01:06:00.840 --> 01:06:03.940]  which is why GPUs work really well for neural networks
[01:06:03.940 --> 01:06:08.100]  because you can parallelize matrix multiplications like that
[01:06:08.780 --> 01:06:10.220]  So I'm going to go ahead and run this
[01:06:10.220 --> 01:06:11.120]  Shift-Enter
[01:06:13.320 --> 01:06:17.220]  And it looks like it's running. Hopefully it doesn't take too long.
[01:06:19.260 --> 01:06:23.920]  Another comparison to make is a neural network versus a gradient-boosted tree.
[01:06:23.920 --> 01:06:28.580]  A gradient-boosted tree is all just if-then statements.
[01:06:29.460 --> 01:06:34.000]  Versus a neural network is a huge pain in the ass to multiply everything.
[01:06:34.000 --> 01:06:39.220]  So a GBT could be run in milliseconds
[01:06:39.220 --> 01:06:43.060]  versus a neural network taking seconds in 100% of your CPU.
[01:06:45.000 --> 01:06:48.120]  Actually, speaking of neural networks, I'm kind of curious about this.
[01:06:48.120 --> 01:06:50.440]  How well this would be received at DEF CON.
[01:06:50.440 --> 01:06:54.220]  Would anybody be interested? And please vote in the Discord channel.
[01:06:54.240 --> 01:06:58.400]  If you would be interested in learning how to build neural networks from scratch
[01:06:58.400 --> 01:07:03.180]  or how to build deep learning security tools
[01:07:03.180 --> 01:07:05.920]  I think both would be really interesting.
[01:07:05.920 --> 01:07:09.120]  Neural networks from scratch is actually a lot easier than you would imagine.
[01:07:09.120 --> 01:07:11.480]  You can do it in about 11 lines of Python code.
[01:07:14.640 --> 01:07:18.360]  Today we are able to simplify and easily explain the math
[01:07:18.360 --> 01:07:22.200]  whereas a lot of these algorithms were invented in the 50s
[01:07:23.020 --> 01:07:27.460]  where the people who would do this kind of work were called statisticians
[01:07:27.460 --> 01:07:33.520]  or people who would apply these in business scenarios were called business intelligence.
[01:07:33.960 --> 01:07:37.340]  But today with computer science and these abstraction libraries
[01:07:37.340 --> 01:07:39.560]  it's a lot easier to understand.
[01:07:39.580 --> 01:07:43.760]  So let me know in the Discord chat if there's any interest in either of those
[01:07:43.760 --> 01:07:47.200]  as potentially future workshops for future DEF CON events.
[01:07:47.220 --> 01:07:49.280]  Or even just AI Village events.
[01:07:49.840 --> 01:07:53.020]  So KDAMinMax has a question on Twitch chat.
[01:07:53.020 --> 01:07:54.980]  I'll just read it here.
[01:07:54.980 --> 01:07:58.380]  I have a possibly stupid question about input structures.
[01:07:58.380 --> 01:08:01.860]  If you're learning, there's no real stupid questions.
[01:08:04.720 --> 01:08:08.860]  An email predictor acts on strings or a set of strings.
[01:08:08.860 --> 01:08:12.600]  Are there any machine learning approaches that act reasonably well on trees?
[01:08:12.640 --> 01:08:16.020]  For example, abstract syntax tree representations of applications.
[01:08:16.600 --> 01:08:20.740]  For that, you might want to use a graph convolutional network
[01:08:20.740 --> 01:08:28.340]  or, honestly, you can interpret language as an abstract syntax tree.
[01:08:28.340 --> 01:08:31.020]  So you might want to reach in and use some of the...
[01:08:31.020 --> 01:08:36.160]  There's a huge backlog of literature that people largely ignore
[01:08:36.160 --> 01:08:39.020]  because of the success of deep learning
[01:08:39.020 --> 01:08:44.680]  of how to do language processing using abstract syntax trees.
[01:08:44.900 --> 01:08:48.640]  So you might want to reach into some old school machine learning
[01:08:48.640 --> 01:08:52.660]  from before 2010 when neural networks destroyed the world.
[01:08:53.960 --> 01:08:56.760]  I mean, built our jobs.
[01:08:59.300 --> 01:09:01.420]  That was a good answer. Thank you.
[01:09:04.600 --> 01:09:09.660]  So what we're looking at is logistic regression with TF-IDF vectorizer.
[01:09:09.660 --> 01:09:11.540]  We have a slightly better...
[01:09:11.540 --> 01:09:13.640]  It's a small bump, but it's slightly better
[01:09:13.640 --> 01:09:16.020]  than the multinomial with the count vectorizer.
[01:09:16.020 --> 01:09:20.140]  We have a 95.9% accuracy.
[01:09:20.400 --> 01:09:24.880]  We see that we have a little bit more than the 26 ham emails
[01:09:24.880 --> 01:09:27.280]  that were incorrectly classified as spam.
[01:09:27.280 --> 01:09:31.900]  We have significantly fewer of the spam emails being classified as ham.
[01:09:32.100 --> 01:09:36.260]  So this would be a perfect instance of moving towards
[01:09:36.600 --> 01:09:39.900]  having absolutely zero spam get through our filter
[01:09:39.900 --> 01:09:44.840]  at the risk of losing some of our regular emails.
[01:09:46.020 --> 01:09:49.380]  But what I want to show is how the accuracy improves
[01:09:49.380 --> 01:09:52.880]  just changing the algorithm itself.
[01:09:52.880 --> 01:09:55.700]  And then looking at multinomial Naive Bayesian
[01:09:55.700 --> 01:09:59.640]  compared to logistic regression both using TF-IDF.
[01:09:59.640 --> 01:10:02.320]  We had 87.5% for the multinomial
[01:10:02.600 --> 01:10:06.720]  and then we have 95.9% for the logistic regression.
[01:10:06.720 --> 01:10:09.720]  So far, logistic regression is kind of beating out multinomial
[01:10:09.720 --> 01:10:14.100]  in the TF-IDF and the count metrics.
[01:10:14.100 --> 01:10:16.680]  So we're going to do our last one which is going to be
[01:10:19.980 --> 01:10:23.760]  logistic regression with count vectorizer.
[01:10:23.760 --> 01:10:25.840]  Count vectorizer worked well for multinomial
[01:10:25.840 --> 01:10:28.860]  so let's see if it works well for logistic regression.
[01:10:28.860 --> 01:10:30.960]  So we're going to do LGS count.
[01:10:36.550 --> 01:10:38.070]  And I'm just going to type this out
[01:10:38.070 --> 01:10:41.310]  but you can use the same copy-paste trick that I used for the multinomial
[01:10:41.310 --> 01:10:42.810]  with the count vectorizer.
[01:10:44.670 --> 01:10:48.030]  I'm just doing this as a way to kind of slow myself down
[01:10:48.030 --> 01:10:52.130]  so that you all have an opportunity to copy the code.
[01:10:52.350 --> 01:10:54.150]  So LBFGS.
[01:10:54.570 --> 01:11:03.500]  We're going to fit our labels.
[01:11:03.500 --> 01:11:06.500]  And we got to do our scores, predictions,
[01:11:06.500 --> 01:11:08.940]  confusion matrix, and classification report.
[01:11:18.730 --> 01:11:19.990]  Yeah, one of the things
[01:11:21.090 --> 01:11:23.070]  that you kind of reminded me of, Sven,
[01:11:23.070 --> 01:11:25.150]  when you've been talking about the different trees and algorithms
[01:11:25.150 --> 01:11:28.230]  that have been overtaken by
[01:11:30.490 --> 01:11:32.290]  neural networks and deep neural networks
[01:11:32.290 --> 01:11:35.090]  is random forests. Random forests operate very well
[01:11:35.090 --> 01:11:39.530]  on a lot of different scenarios.
[01:11:40.310 --> 01:11:41.350]  But because
[01:11:41.350 --> 01:11:44.710]  random forests are a cluster of decision trees
[01:11:44.710 --> 01:11:46.810]  people go, oh, they're just if-then statements.
[01:11:46.810 --> 01:11:48.870]  This is dumb. This isn't machine learning.
[01:11:49.670 --> 01:11:52.970]  Yeah, but for us, we have
[01:11:53.050 --> 01:11:55.810]  a highly shared data where,
[01:11:55.810 --> 01:11:58.870]  for example, on the Ember malware dataset,
[01:11:59.470 --> 01:12:02.330]  the lightweight,
[01:12:02.330 --> 01:12:03.250]  incredibly cheap
[01:12:05.090 --> 01:12:07.970]  gradient decision tree built with LightGBM
[01:12:07.970 --> 01:12:11.190]  outperformed any neural network you can throw at it.
[01:12:11.190 --> 01:12:13.910]  So the LightGBM gets 99.8%
[01:12:13.910 --> 01:12:16.970]  accuracy or something like that, and the neural network gets
[01:12:16.970 --> 01:12:19.910]  98.7% accuracy, which is
[01:12:19.910 --> 01:12:21.870]  significantly worse.
[01:12:24.350 --> 01:12:26.230]  For some tasks,
[01:12:26.410 --> 01:12:29.650]  a gradient-based decision tree or something like that
[01:12:29.650 --> 01:12:32.170]  is the correct thing mathematically
[01:12:32.590 --> 01:12:34.290]  because neural networks have biases
[01:12:34.970 --> 01:12:38.070]  because it's easier for them to learn some things than others
[01:12:38.070 --> 01:12:41.050]  and GBTs have biases where
[01:12:41.050 --> 01:12:43.930]  it's easier for them to learn some things than others.
[01:12:43.930 --> 01:12:46.550]  And one of the great secrets
[01:12:46.550 --> 01:12:48.510]  of machine learning is
[01:12:49.190 --> 01:12:52.290]  match up your problem with both
[01:12:52.640 --> 01:12:55.850]  the requirements, how powerful your predictor
[01:12:55.850 --> 01:12:58.010]  needs to be, and the inherent biases
[01:12:58.010 --> 01:13:00.810]  of those predictors.
[01:13:02.130 --> 01:13:02.910]  Yeah.
[01:13:06.480 --> 01:13:09.300]  Yeah, that's pretty interesting. I didn't know that.
[01:13:10.700 --> 01:13:12.780]  But that also reminds me of another thing
[01:13:12.780 --> 01:13:16.720]  with neural networks. Neural networks are really good at learning superstitions.
[01:13:17.060 --> 01:13:19.520]  I don't know if that's a machine learning as a whole thing
[01:13:19.520 --> 01:13:21.780]  or specifically in neural networks
[01:13:21.780 --> 01:13:24.960]  but I remember there was a dataset that they were using to identify
[01:13:24.960 --> 01:13:28.040]  skin cancer cells. They used a convolutional
[01:13:28.040 --> 01:13:31.180]  neural network and based off of that it was learning
[01:13:31.180 --> 01:13:34.640]  that if there was a... it was performing too well.
[01:13:34.640 --> 01:13:36.960]  It got nearly 100% and they
[01:13:37.540 --> 01:13:40.140]  kind of dug into what it was doing
[01:13:40.140 --> 01:13:42.800]  and it said, OK, well if I see a ruler in the picture
[01:13:42.800 --> 01:13:45.780]  which the scientists, how they were collecting the images, they had a ruler
[01:13:45.780 --> 01:13:48.860]  next to the skin cancer cells. If I see a ruler in the picture
[01:13:48.860 --> 01:13:52.380]  that's how I know it's cancer. Not the little blob
[01:13:52.380 --> 01:13:53.480]  next to it.
[01:13:55.100 --> 01:13:57.640]  So I don't know if that's a whole machine learning problem
[01:13:57.640 --> 01:13:59.140]  or specifically a neural net problem
[01:13:59.140 --> 01:14:03.140]  but it's just kind of an interesting anecdote.
[01:14:04.060 --> 01:14:07.740]  Be careful how you collect and preprocess and use your data.
[01:14:09.060 --> 01:14:11.640]  A friend of mine did his PhD.
[01:14:11.640 --> 01:14:14.340]  His PhD thesis was inducing the
[01:14:15.220 --> 01:14:17.560]  bias that's sort of like the inherent
[01:14:18.360 --> 01:14:21.140]  model bias that's in convolutional
[01:14:21.140 --> 01:14:23.980]  neural networks into other models
[01:14:23.980 --> 01:14:26.140]  and it slightly improved.
[01:14:29.200 --> 01:14:31.620]  By inducing a bias?
[01:14:31.840 --> 01:14:35.340]  Yeah, you can induce a pixel
[01:14:35.340 --> 01:14:38.100]  locality bias like a great green
[01:14:38.100 --> 01:14:42.140]  tree or a support vector machine
[01:14:43.260 --> 01:14:44.780]  and it improves
[01:14:45.700 --> 01:14:47.740]  that's one of the assumptions that people make
[01:14:47.740 --> 01:14:50.920]  about convolutional neural networks is it allows you to make
[01:14:50.920 --> 01:14:53.900]  small pixel local decisions
[01:14:54.880 --> 01:14:56.880]  and then those can get glued together
[01:14:56.880 --> 01:14:58.500]  in the right way.
[01:15:00.160 --> 01:15:02.900]  And so he went and built little modifications
[01:15:02.900 --> 01:15:05.580]  to a dozen other algorithms
[01:15:05.580 --> 01:15:07.260]  that have sort of fallen out of favor
[01:15:07.260 --> 01:15:11.400]  and all of them improved slightly on images
[01:15:11.400 --> 01:15:14.500]  when you induced a pixel local bias.
[01:15:14.980 --> 01:15:17.160]  Okay, interesting.
[01:15:17.160 --> 01:15:20.200]  Yeah, that's
[01:15:23.140 --> 01:15:23.920]  that's actually
[01:15:23.920 --> 01:15:24.900]  really interesting.
[01:15:27.480 --> 01:15:28.920]  Alright, so
[01:15:29.520 --> 01:15:32.620]  hopefully I gave you guys enough time to make sure that this got
[01:15:32.620 --> 01:15:34.980]  copied down. I'm going to go ahead and run it
[01:15:34.980 --> 01:15:38.220]  and shift enter. This is going to take a little bit because it's going through
[01:15:38.220 --> 01:15:40.320]  its matrix multiplications
[01:15:41.200 --> 01:15:44.100]  but in the meantime, we're about to hop into
[01:15:44.620 --> 01:15:46.780]  our final task
[01:15:46.780 --> 01:15:49.960]  and what we have is a real spam email
[01:15:49.960 --> 01:15:52.580]  I pulled this out of my inbox or my spam folder
[01:15:54.220 --> 01:15:55.100]  last May
[01:15:55.860 --> 01:15:58.160]  but we see your latest issue is available now
[01:15:58.160 --> 01:16:00.940]  if you don't want issue notifications, click here to unsubscribe
[01:16:01.660 --> 01:16:03.840]  Hi George. My name is not George
[01:16:05.180 --> 01:16:07.400]  but we're going to look at this email
[01:16:07.400 --> 01:16:10.260]  and we're going to try it against all four of our spam classifiers
[01:16:10.260 --> 01:16:13.360]  and see how well they do. But what's also good about this exercise
[01:16:13.360 --> 01:16:16.400]  is it doesn't really hold your hand. This is usually something
[01:16:16.400 --> 01:16:19.500]  that I give plenty of time in a classroom
[01:16:19.500 --> 01:16:21.480]  setting. This allows people to ask questions
[01:16:22.100 --> 01:16:24.860]  as I go through. But this last task is going to say
[01:16:24.860 --> 01:16:27.360]  OK, we have a new email we've never seen before
[01:16:29.540 --> 01:16:31.160]  run it against our classifiers
[01:16:31.160 --> 01:16:33.960]  that we trained. And so you have to go through
[01:16:33.960 --> 01:16:37.080]  the whole process of using the
[01:16:37.080 --> 01:16:40.000]  vectorizers to vectorize the tokens
[01:16:40.000 --> 01:16:42.940]  or the words in the email and then putting them through the predict
[01:16:42.940 --> 01:16:45.880]  function and then printing out the prediction
[01:16:45.880 --> 01:16:48.960]  that it gives. So before
[01:16:48.960 --> 01:16:51.320]  we hop into that, I really want this to finish
[01:16:51.320 --> 01:16:54.620]  I want it to just be on its own time
[01:16:55.780 --> 01:16:56.380]  but
[01:16:59.600 --> 01:17:01.240]  we're going to
[01:17:01.240 --> 01:17:03.860]  compare how well these models did
[01:17:05.040 --> 01:17:06.760]  once this last one is complete
[01:17:06.760 --> 01:17:09.520]  and we're going to see if or how much
[01:17:09.520 --> 01:17:12.660]  that comparison has an influence on our test email
[01:17:15.180 --> 01:17:17.460]  So we're just going to give this another
[01:17:18.580 --> 01:17:21.180]  Probably be another 30 seconds or so.
[01:17:21.180 --> 01:17:22.860]  So it's not too long.
[01:17:22.860 --> 01:17:24.000]  How's the Discord chat?
[01:17:27.820 --> 01:17:29.460]  Discord chat's doing well.
[01:17:36.560 --> 01:17:40.940]  I really appreciate how much discussion we have in this version of the workshop.
[01:17:41.180 --> 01:17:45.000]  Also, and this is something that I want to note for people who are seeing this for the first time.
[01:17:45.000 --> 01:17:47.820]  So the last time I did my workshop on Friday,
[01:17:47.820 --> 01:17:51.280]  unfortunately my webcam froze in a very unflattering position,
[01:17:51.280 --> 01:17:53.140]  so I just had my mouth gaping open.
[01:17:54.540 --> 01:17:59.600]  So for this one I just put a static image of myself today.
[01:17:59.940 --> 01:18:01.740]  That way you can still see my beautiful face,
[01:18:01.740 --> 01:18:04.940]  but you don't get a very unflattering image of it.
[01:18:05.780 --> 01:18:08.320]  Alright, so this one is complete.
[01:18:08.320 --> 01:18:13.780]  We are now at the best model that we have of the ones that we've trained.
[01:18:13.780 --> 01:18:18.200]  We have a 97.69% accuracy.
[01:18:18.200 --> 01:18:21.100]  We can round that up to a 97.7% accuracy.
[01:18:21.100 --> 01:18:28.180]  We see that only 8 out of our 582 ham emails were misclassified as spam,
[01:18:28.180 --> 01:18:34.720]  and then only 12 out of our 264 spam emails were misclassified as ham,
[01:18:34.720 --> 01:18:39.680]  making this the best model that we have out of the 4 that we've trained.
[01:18:41.400 --> 01:18:43.980]  So that's pretty solid.
[01:18:44.960 --> 01:18:49.860]  So moving forward, what I would do is take this as kind of the better of the models,
[01:18:49.860 --> 01:18:53.140]  and I would go ahead and deploy this in production,
[01:18:53.140 --> 01:19:00.440]  or add this model into an application that's designed to kind of wrap around the idea of spam and ham.
[01:19:00.960 --> 01:19:05.520]  So if you are simply an email provider, you can use this to filter email,
[01:19:05.520 --> 01:19:13.300]  or if you are a spam company, or not a spam company, but like a spam research company,
[01:19:13.300 --> 01:19:20.920]  you would use this to grab the spam messages and then collect that as data,
[01:19:20.920 --> 01:19:25.540]  and then you can use that to be training the next version of your spam filter.
[01:19:27.680 --> 01:19:31.580]  So let's take a look at our test spam email.
[01:19:32.180 --> 01:19:36.900]  We have our final task, which is to use the vectorizer and the models that we created
[01:19:36.900 --> 01:19:42.360]  to perform predictions on our test email and our test spam email.
[01:19:42.360 --> 01:19:49.500]  So to make things easier, what I'm going to do is use a variable, I'll call this working email,
[01:19:50.340 --> 01:19:53.420]  and this will allow us to switch pretty quickly.
[01:19:54.180 --> 01:20:00.620]  So I have test email, and then we're going to want to vectorize our email,
[01:20:00.620 --> 01:20:04.960]  so we're going to use our test email tfidf,
[01:20:07.640 --> 01:20:11.620]  yes, tvec.transform,
[01:20:12.520 --> 01:20:15.940]  and this is going to be, remember it needs to be in a list,
[01:20:15.940 --> 01:20:19.620]  so we've got to use these square brackets, our working email,
[01:20:20.280 --> 01:20:27.660]  and then we'll do the same thing with our test email count, tvec.transform,
[01:20:30.760 --> 01:20:32.180]  working email.
[01:20:33.180 --> 01:20:36.520]  Now we need to get the raw predictions for every single one of them.
[01:20:36.520 --> 01:20:40.880]  The raw predictions, we already have the variable names here, kind of at the bottom,
[01:20:40.880 --> 01:20:42.960]  so I'm just going to go ahead and write them out.
[01:20:42.960 --> 01:20:46.660]  So we have test email mnb with tfidf,
[01:20:47.840 --> 01:20:52.600]  which is going to be our mnbtfidf.predict,
[01:20:52.600 --> 01:20:59.840]  on our test email with the tfidf vectorized info,
[01:21:00.620 --> 01:21:04.520]  we're going to do the same thing for count,
[01:21:04.520 --> 01:21:06.800]  using our count.predict,
[01:21:14.490 --> 01:21:17.830]  with our count vectorized,
[01:21:18.390 --> 01:21:23.850]  test email lgs, tfidf.predict,
[01:21:32.020 --> 01:21:35.220]  tfidf.data, and our last one,
[01:21:35.220 --> 01:21:38.480]  test email lgs.count,
[01:21:44.520 --> 01:21:45.560]  count.
[01:21:47.040 --> 01:21:49.560]  And then, just for fun and games,
[01:21:49.560 --> 01:21:52.800]  we can print out our working email.
[01:21:53.020 --> 01:21:55.240]  So this is going to print our test email.
[01:21:55.240 --> 01:21:57.400]  Our test email was the one that we saw,
[01:21:57.400 --> 01:21:59.940]  EastAsianFonts and Lenny, thanks for your support.
[01:22:02.500 --> 01:22:06.640]  So you can go ahead and copy this down.
[01:22:06.640 --> 01:22:09.220]  I'm going to go ahead and run this,
[01:22:09.220 --> 01:22:11.780]  because we are running close to time.
[01:22:11.780 --> 01:22:13.820]  So this is going to be our test email,
[01:22:13.820 --> 01:22:15.840]  and ideally, this should say that
[01:22:15.840 --> 01:22:18.140]  it is a regular email, not spam.
[01:22:18.140 --> 01:22:21.100]  It's going to label all four of these as ham.
[01:22:21.120 --> 01:22:22.620]  So let's run it.
[01:22:22.920 --> 01:22:24.780]  Working is not defined,
[01:22:26.540 --> 01:22:29.540]  because I forgot to call this working email.
[01:22:30.200 --> 01:22:31.560]  So we're going to run it.
[01:22:32.280 --> 01:22:33.720]  Here's our email.
[01:22:34.380 --> 01:22:36.140]  And all four of our classifiers
[01:22:36.140 --> 01:22:39.500]  correctly classified this as ham.
[01:22:40.360 --> 01:22:42.220]  I'm going to scroll back up,
[01:22:42.220 --> 01:22:48.280]  and we saved our email as our test spam email.
[01:22:48.280 --> 01:22:50.780]  So I'm going to grab that,
[01:22:50.780 --> 01:22:52.520]  and I'll run it.
[01:22:53.260 --> 01:22:55.520]  Oh, I forgot to run the code block
[01:22:55.520 --> 01:22:58.200]  that actually defined our test spam email.
[01:22:59.120 --> 01:23:00.960]  And now we can run this.
[01:23:01.560 --> 01:23:03.060]  And we see something interesting here.
[01:23:03.060 --> 01:23:04.760]  Our worst performing model,
[01:23:04.760 --> 01:23:06.720]  our MNB TF-IDF,
[01:23:06.720 --> 01:23:09.480]  which was 87.5% accurate,
[01:23:10.520 --> 01:23:12.900]  misclassified this email as being legitimate,
[01:23:12.900 --> 01:23:14.000]  as a ham email.
[01:23:14.240 --> 01:23:15.600]  But the other three of ours
[01:23:15.600 --> 01:23:18.100]  classified this correctly as a spam email.
[01:23:18.740 --> 01:23:20.700]  So there's a couple things we can do with this.
[01:23:20.700 --> 01:23:22.380]  If we were so inclined,
[01:23:22.380 --> 01:23:24.540]  we can use what's called ensemble learning,
[01:23:24.540 --> 01:23:27.380]  which we use all four algorithms.
[01:23:27.380 --> 01:23:29.120]  We take the average of them,
[01:23:29.120 --> 01:23:31.240]  or in other words,
[01:23:31.240 --> 01:23:34.820]  they give a vote on what they think the class is.
[01:23:34.820 --> 01:23:36.420]  And then in this case,
[01:23:36.420 --> 01:23:37.820]  three out of the four say it's spam,
[01:23:37.820 --> 01:23:38.740]  so it would be spam.
[01:23:38.740 --> 01:23:42.240]  Or we can use this as an example and say,
[01:23:42.240 --> 01:23:44.420]  hey, the MNB TF-IDF,
[01:23:44.420 --> 01:23:45.680]  we probably shouldn't use that
[01:23:45.680 --> 01:23:47.780]  simply because it's misclassifying
[01:23:47.780 --> 01:23:49.680]  known spam emails
[01:23:50.420 --> 01:23:52.340]  at a high rate.
[01:23:52.340 --> 01:23:55.180]  87% means that we have
[01:23:55.180 --> 01:23:58.840]  about a 12.5%
[01:24:00.640 --> 01:24:03.760]  chance of misclassifying spam emails.
[01:24:05.020 --> 01:24:06.740]  So this is kind of
[01:24:06.740 --> 01:24:08.960]  pretty close to the conclusion of the workshop.
[01:24:08.960 --> 01:24:10.380]  This shows how you can take
[01:24:10.380 --> 01:24:12.540]  raw emails, raw data,
[01:24:12.540 --> 01:24:14.100]  transform it in a way
[01:24:14.100 --> 01:24:15.300]  that can be used in
[01:24:15.780 --> 01:24:17.520]  machine learning models.
[01:24:17.720 --> 01:24:19.800]  And then we build out our machine learning models,
[01:24:19.800 --> 01:24:20.880]  we evaluate them,
[01:24:20.880 --> 01:24:22.660]  and we can test them.
[01:24:22.660 --> 01:24:24.020]  And this last task shows you
[01:24:24.020 --> 01:24:26.220]  how we can properly deploy them.
[01:24:26.220 --> 01:24:28.080]  So we can build a piece of software
[01:24:28.080 --> 01:24:29.240]  or an application around it
[01:24:29.240 --> 01:24:30.820]  and say every time we get an email,
[01:24:32.180 --> 01:24:33.500]  just swap out
[01:24:33.500 --> 01:24:34.960]  this working email,
[01:24:34.960 --> 01:24:37.580]  and then you can leave everything else the same.
[01:24:38.800 --> 01:24:39.960]  Now, before I let you go,
[01:24:39.960 --> 01:24:41.900]  there's one more thing that I want to show you.
[01:24:42.260 --> 01:24:44.440]  I mentioned hyperparameter tuning.
[01:24:44.440 --> 01:24:46.000]  So let's take a look and see
[01:24:46.000 --> 01:24:48.620]  what hyperparameter tuning can do.
[01:24:48.980 --> 01:24:50.060]  So when we were talking about
[01:24:50.060 --> 01:24:51.720]  the multinomial Naive Bayesian,
[01:24:51.720 --> 01:24:53.560]  we saw that we had that smoothing function,
[01:24:53.560 --> 01:24:55.620]  the alpha, right?
[01:24:56.000 --> 01:24:58.280]  And it was normally 1.
[01:24:58.580 --> 01:25:00.980]  So we can add that in here.
[01:25:00.980 --> 01:25:03.360]  So we have alpha equals 1.
[01:25:03.360 --> 01:25:07.900]  We see that an 87.5% accuracy.
[01:25:08.280 --> 01:25:09.720]  We go ahead and run that.
[01:25:09.720 --> 01:25:10.800]  I'm just doing this to show you
[01:25:10.800 --> 01:25:12.260]  that there are no tricks up my sleeve.
[01:25:12.260 --> 01:25:15.460]  Alpha is by default 1 for this model.
[01:25:16.240 --> 01:25:17.740]  But what if we change that
[01:25:17.740 --> 01:25:20.060]  to maybe something a little smaller, right?
[01:25:20.060 --> 01:25:21.480]  Maybe a 0.1.
[01:25:21.480 --> 01:25:24.200]  Let's reduce the impact of that smoothing filter.
[01:25:24.200 --> 01:25:25.280]  So I'm going to go ahead
[01:25:25.280 --> 01:25:28.200]  and press Shift-Enter and run that.
[01:25:28.860 --> 01:25:30.300]  And we see something interesting.
[01:25:30.300 --> 01:25:34.160]  So this went from our worst model,
[01:25:34.160 --> 01:25:36.340]  87.5% accuracy,
[01:25:36.340 --> 01:25:40.140]  to actually our second best model,
[01:25:40.140 --> 01:25:43.060]  96.5% accuracy.
[01:25:43.740 --> 01:25:47.300]  And so it slightly misclassified
[01:25:48.380 --> 01:25:50.720]  some of the legitimate ham emails as spam.
[01:25:50.720 --> 01:25:52.780]  But it drastically reduced the,
[01:25:52.780 --> 01:25:56.300]  I think it was 108 spam emails
[01:25:57.160 --> 01:26:00.260]  being misclassified down to just 22.
[01:26:01.740 --> 01:26:04.240]  So this is the power of hyperparameter tuning.
[01:26:04.240 --> 01:26:05.400]  And if we go all the way back down
[01:26:05.400 --> 01:26:07.720]  to our final task, task 7,
[01:26:07.720 --> 01:26:09.640]  and we run this,
[01:26:09.640 --> 01:26:12.640]  where the multinomial manipulation with TF-IDF
[01:26:13.240 --> 01:26:14.840]  incorrectly classified that as ham,
[01:26:14.840 --> 01:26:16.440]  we go ahead and rerun it.
[01:26:16.760 --> 01:26:18.860]  Now it's correctly classifying it as spam
[01:26:18.860 --> 01:26:20.500]  because its accuracy is much better.
[01:26:20.500 --> 01:26:22.040]  And the only thing that we did to change it
[01:26:22.040 --> 01:26:23.840]  was that hyperparameter tuning.
[01:26:26.920 --> 01:26:29.060]  So that's all I have for you guys today.
[01:26:29.060 --> 01:26:32.440]  I have a couple good resources,
[01:26:32.440 --> 01:26:34.320]  but the biggest one that's going to be helpful
[01:26:35.160 --> 01:26:36.760]  is if you have any questions,
[01:26:36.760 --> 01:26:37.720]  feel free to reach out to me.
[01:26:37.720 --> 01:26:39.540]  I am gtklondike on Gmail.
[01:26:39.540 --> 01:26:41.760]  I'm also gtklondike on Twitter.
[01:26:42.020 --> 01:26:43.400]  I'm not as active on Twitter,
[01:26:43.400 --> 01:26:44.780]  but I will see your direct messages
[01:26:44.780 --> 01:26:47.540]  or any at mentions.
[01:26:48.340 --> 01:26:50.120]  And then the GitHub code
[01:26:50.920 --> 01:26:52.360]  for all of these workbooks,
[01:26:52.360 --> 01:26:54.060]  this and all the workbooks,
[01:26:54.060 --> 01:26:57.000]  are on github.com slash netsecexplained.
[01:26:59.060 --> 01:27:01.540]  So that is all I have for you guys.
[01:27:01.540 --> 01:27:04.660]  It looks like we just barely made it.
[01:27:04.780 --> 01:27:05.740]  1231.
[01:27:06.960 --> 01:27:09.160]  And I will let you go.
[01:27:09.160 --> 01:27:12.420]  I will stay in the chat channel
[01:27:12.420 --> 01:27:13.240]  for a little bit
[01:27:13.240 --> 01:27:16.140]  in case any of you would like to continue to discuss.
[01:27:16.140 --> 01:27:17.480]  And I'll also stay in the voice channel
[01:27:17.480 --> 01:27:20.100]  for a little bit for the same reason.
[01:27:20.220 --> 01:27:22.480]  So I will see you there.
[01:27:22.720 --> 01:27:23.900]  Have a good day.
